Skip to content
HomeBlogSnippets

The Architecture of Distributed Observability

Rowland Adimoha

Rowland Adimoha / June 03, 2026

55 min read


HomeAbout
Rowland Adimoha © 2026
TwitterGitHub
Snippets
The Architecture of Distributed Observability

Observability platforms often fail in inverse proportion to the seriousness of their stated mission. There is no spectacular crash, no smoking gun in a stack trace, no on-call page at three in the morning. There is, instead, a slow erosion of confidence. A latency dashboard becomes "directional rather than accurate." A traces tab acquires a reputation for being slow enough that engineers stop opening it. A logs query that used to return in two seconds quietly begins returning in twelve, and then in forty, and then is silently capped at a thousand rows that the operator did not request and does not realise. The system that was built to be the source of truth becomes the place engineers go to confirm what they already suspect.

This technical analysis is derived from the engineering of a self-hosted, multi-tenant observability platform written in Go: a distributed system whose entire purpose is to make other distributed systems legible. Building such a platform does not begin with a feature list; it begins with a threat model. It must be assumed that the data is hostile (cardinality is unbounded, payloads are malformed, tenants are noisy), that the network is fallible (ClickHouse will stall, NATS will partition, S3 will return 503), and that operators are forgetful (someone, somewhere, will paste a one-line shell script that ships ten thousand spans per second from a debug build into production). This is an examination of how such a system is engineered to survive that reality: the architectural primitives, the trade-offs, and the philosophy of a platform that treats "correctness under load" as its primary primitive.

The Ghost in the Telemetry: Why Observability Platforms Fail Silently

The fundamental tension in observability engineering is between fidelity and economics. A perfectly observable system would record every event, every state transition, every byte of every payload, indexed for arbitrary query at sub-second latency. Such a system would also cost more to operate than the system it observes. Every real observability platform is therefore a sequence of compromises: which data is dropped, which is sampled, which is aggregated, which is held in expensive hot storage, which is exiled to cheap cold storage.

The danger is that these compromises are usually invisible from the dashboard. A 50% sampling rate looks identical to a 100% sampling rate until the moment a specific request needs to be located, at which point one of two things has happened to it and there is no way to tell which.

The Three Pillars and the False Premise of Independence

The dominant framing of observability, made canonical by the OpenTelemetry project and the Distributed Systems Observability booklet (Cindy Sridharan, O'Reilly 2018), divides telemetry into three pillars: metrics, logs, and traces. The framing is useful for organising vendor pitches. It is misleading as an architecture.

The pillars are not independent. A trace is, structurally, a tree of structured logs that happen to share a trace_id. A metric is, in OpenTelemetry's own model, a temporally aggregated sequence of point-in-time observations that could equivalently be expressed as a series of structured log records. The pillars exist as distinct concepts because the storage substrates that won the early market — Prometheus for metrics, Elasticsearch for logs, Jaeger for traces — had incompatible schemas, query languages, and operational profiles. The pillars are a historical accident hardened into vocabulary.

A modern observability platform that takes the pillars literally inherits all of their friction. Three storage tiers must be operated, three query languages must be learned, three sets of correlation IDs must be plumbed through every transport, and three retention policies must be negotiated with finance. The platform examined here makes the opposite choice: a single columnar store (ClickHouse) holds all three signal types in three tables that share tenant_id as the leading sort key, and a single query language (ObserveQL, with compatibility shims for PromQL and LogQL) addresses them uniformly.

Cardinality, the Silent Killer

The single most common cause of an observability platform's economic collapse is uncontrolled label cardinality. A request_duration histogram tagged with customer_id looks reasonable at three customers and catastrophic at three hundred thousand. The Prometheus documentation has warned about this since version 1, with the canonical advice that "every unique combination of key-value label pairs represents a new time series." In practice, the warning is ignored until the storage backend's index outgrows its memory budget and the platform begins to drop scrape attempts faster than it ingests them.

The architectural response is to make cardinality a first-class operational concern. In the platform examined here, every receiver enriches incoming signals with a small, fixed set of attributes (observex.version, observex.ingested_at, tenant-derived metadata) and refuses to inflate the attribute set beyond what the schema expects. Unbounded label space is pushed into ClickHouse Map(String, String) columns rather than into first-class index columns, where its query cost is borne only when queries actually touch it.

sig.Attributes["observex.version"] = "1.0.0"
sig.Attributes["observex.ingested_at"] = time.Now().UTC().Format(time.RFC3339Nano)

The discipline is not glamorous. It consists of refusing pull requests that add new label keys without a written justification, and of building dashboards that show per-tenant cardinality growth alongside per-tenant ingest volume so the operations team sees the second-order metric before the first-order metric becomes a problem.

The Vendor Lock-in Tax

The third hidden failure mode is the slow accretion of vendor-specific assumptions throughout the application code. A team adopts a SaaS APM, instruments its code with the vendor's SDK, names its spans according to the vendor's preferred conventions, embeds the vendor's correlation headers in its outbound HTTP calls, and writes alerting rules in the vendor's bespoke expression language. Five years later, switching costs are measured in person-years.

The OpenTelemetry project exists in significant part to solve this problem. By standardising the wire format (OTLP), the semantic conventions (HTTP attributes, RPC attributes, database attributes), and the SDK surface (Tracer, Meter, Logger), it allows the instrumentation to be portable across backends. The platform examined here is OTLP-native — services/ingest-gateway exposes /v1/traces, /v1/metrics, and /v1/logs accepting OTLP-protobuf with optional gzip — precisely so the instrumentation library used by a customer's application is decoupled from the choice of backend storage. The W3C Trace Context recommendation provides the same portability guarantee at the propagation layer.

The Architecture of a Receiver: Back-Pressure as a First-Class Citizen

The first design decision in any high-throughput ingest system is what happens when the system is asked to accept more than it can durably persist. The naïve answer is "queue it in memory and hope the burst passes." The naïve answer fails the moment the burst does not pass: the queue grows, the garbage collector grows with it, the latency tail balloons, the pod is killed by Kubernetes for OOM, and the queue's contents are lost. The result is worse than refusing the load up front.

The Bounded Channel and the Honest 429

The architectural primitive is a bounded channel. In Go, a buffered channel with a capacity declared at construction time provides exactly the right semantics: writers block (or fail) once the buffer is full, and the buffer size is an explicit, profiled budget rather than a function of available RAM. In the platform examined here, the ingest pipeline begins with a 65,536-slot channel that the public-facing ProcessSignal API treats as a non-blocking try-send:

var ErrOverloaded = errors.New("processing engine overloaded")

When the channel is full, ProcessSignal returns ErrOverloaded immediately. The receiver translates this into an HTTP 429 Too Many Requests (or gRPC RESOURCE_EXHAUSTED) before the TCP receive buffers begin to fill. The client SDK — any compliant OTLP exporter — then applies its retry-with-backoff policy, which is exactly the behaviour the OpenTelemetry SDK specification requires.

The honesty of the 429 is more important than its rarity. A system that silently absorbs every request until it collapses creates an unbounded fault domain: the failure is bigger than any single client can compensate for. A system that returns 429 the moment its bounded budget is exceeded distributes the back-pressure across the entire client population, allowing them to retry, buffer, or shed load locally. The pattern is documented in the Google SRE Workbook as "Load shedding from the inside out" and is the difference between a system that degrades gracefully and one that fails catastrophically.

Pipeline Stages and the Single-Owner Rule

Once a signal is past the bounded ingress, it flows through a sequence of pipeline stages chained by smaller bounded channels:

ProcessSignal → ingestCh(65536) → Decode → Validate → Enrich
            → worker pool (GOMAXPROCS) → sampling → WAL → Backend → actor

Each stage has exactly one responsibility, and each piece of state has exactly one owner. The decode stage rejects payloads that are not well-formed JSON or protobuf. The validate stage rejects signals without a tenant_id. The enrich stage stamps platform metadata. The sampling decision lives exclusively in the engine — the per-tenant actor downstream is forbidden from overriding it. This is the single-owner rule, and its absence is the most common source of bugs in pipelines whose stages share mutable state.

func ValidateStage(ctx context.Context, in <-chan signal.Signal) (<-chan signal.Signal, error) {
    out := make(chan signal.Signal, 1024)
    go func() {
        defer close(out)
        for sig := range in {
            if sig.TenantID == "" {
                observability.SignalsDropped.WithLabelValues(
                    "", string(sig.Type), "no_tenant").Inc()
                continue
            }
            select {
            case out <- sig:
            case <-ctx.Done():
                return
            }
        }
    }()
    return out, nil
}

The cost of a stage is one goroutine and a channel; the benefit is that any stage can be reasoned about, tested, and replaced independently. Composition is by Chain(...stages), not by mutating a shared "signal context" object. Doug McIlroy's 1964 memo on Unix pipes — "We should have some ways of coupling programs like garden hose — screw in another segment when it becomes necessary to massage data in another way" — remains the cleanest articulation of the discipline.

The Durability Contract: Bounded Loss vs Theatre

Every observability platform makes a promise about what happens to a signal once it has been accepted. The promises range from the rigorously specified ("acknowledged signals survive a host crash within a bounded window") to the theatrical ("acknowledgment means we accepted your bytes, what happens after is between us and our garbage collector"). The difference is a write-ahead log.

The Write-Ahead Log and the Five-Millisecond Window

A write-ahead log is the oldest durability primitive in databases: every state change is appended to a sequential log on disk before the in-memory state is updated. If the process crashes, recovery replays the log from the last known good position. The technique was articulated for relational databases by C. Mohan and colleagues in the ARIES paper (ACM TODS, 1992) and has propagated into nearly every storage system that takes durability seriously.

The naïve implementation calls fsync(2) on every write. This is correct but slow: a typical NVMe SSD can sustain roughly five to ten thousand fsync operations per second, and an ingest path that needs to accept tens of thousands of signals per second cannot afford one fsync per signal. The technique used to bridge the gap is group commit: writes are batched in page cache and flushed by a background goroutine on a fixed cadence.

const defaultSyncInterval = 5 * time.Millisecond

The platform's contract becomes: "any signal acknowledged at time T is guaranteed durable to disk by time T + 5ms, modulo disk-side caches." The five-millisecond figure is not a guess. It is the smallest interval at which a modern Linux kernel will reliably issue an SSD flush without dominating the device's queue depth, and it is short enough that even an aggressive crash-and-restart loop loses at most a single flush window of data. The trade is documented, measurable, and respected by the rest of the system: downstream consumers know that the WAL is the source of truth for replay, and that anything before the last flush boundary will survive any process-level failure.

Self-Describing Entries: Recovery Without Trust

Group commit alone is not sufficient. A WAL must also be recoverable: after a crash, a new process must be able to read the log files, locate the last consistent record, and resume writing from the next byte. If the log entries are not self-describing, recovery becomes guesswork, and guesswork in a durability layer produces silent data loss.

The platform's entry format is intentionally pedantic:

+-------+-------+-------+-----------+-----------+
| magic | len   | crc32 | timestamp |  payload  |
| u32   | u32   | u32   |  int64    |  []byte   |
+-------+-------+-------+-----------+-----------+

The magic number lets recovery distinguish "valid entry start" from "zeroed tail of a truncated file." The length lets recovery skip the payload without parsing it. The CRC32 lets recovery detect bit-flips on disk (rare but real — Bairavasundaram et al., FAST '08, An Analysis of Latent Sector Errors in Disk Drives, found uncorrectable bit errors in roughly 3.5% of enterprise drives within their first two years). The timestamp lets the WAL serve as a coarse-grained event log when the structured backend is unavailable.

The recovery procedure becomes a single loop: read header, validate magic, validate CRC, advance offset, repeat. The first failure terminates the scan, and the next write resumes from the last good boundary. The procedure is auditable in the same way a file-system journal is auditable — and for the same reasons.

Multi-Tenant Isolation: Beyond the Bearer Token

A multi-tenant observability platform has a strict obligation: under no circumstances may tenant A's data be visible to tenant B. This is not a UI concern, an API concern, or a billing concern. It is a security concern, and the defences against it must be layered, redundant, and database-enforced.

The KeyStore Seam

Authentication begins at the receiver. Every incoming signal must carry an API key in the Authorization header (for HTTP) or in gRPC metadata (for the gRPC path). The key resolves to a tenant identifier through a KeyStore interface:

type KeyStore interface {
    ValidateKey(key string) (tenantID string, valid bool)
}

The interface has three implementations. The production implementation, PostgresKeyStore, looks up keys in a PostgreSQL table where each key is stored as a BLAKE3 digest with an Argon2id-derived password fingerprint, never as plaintext. A second implementation, MemoryKeyStore, exists for tests. A third, StatelessKeyValidator, derives every key from a single shared secret and is documented as dev-only with a security note bordering on a warning label: "any leak of the secret lets an attacker mint a valid key for every tenant."

The choice between implementations is made by configuration: if OBSERVE_X_POSTGRES_URL is set, the production store is loaded; otherwise the gateway refuses to start in non-development mode. This is a fail-closed default: an operator who forgets to configure Postgres cannot accidentally deploy the dev validator into production, because the gateway will refuse to come up.

PostgreSQL Row-Level Security as a Database Invariant

Application-level tenant filtering is necessary but insufficient. A bug in a single WHERE clause can leak data across tenants in a way that no code review will catch consistently. The defence against this is database-enforced tenant isolation, implemented in the platform's control-plane database via PostgreSQL Row-Level Security:

ALTER TABLE api_keys ENABLE ROW LEVEL SECURITY;
 
CREATE POLICY tenant_isolation ON api_keys
    USING (tenant_id = current_setting('app.tenant_id', true));

Before every transaction, the application sets app.tenant_id via SET LOCAL. After that, even a missing WHERE tenant_id = $1 clause produces zero rows rather than a cross-tenant leak. The mechanism is documented at length in the PostgreSQL Row Security Policies documentation and is the standard pattern for multi-tenant Postgres deployments.

The same principle is replicated in ClickHouse via mandatory tenant_id predicates that the query planner injects from the bearer-token context. The platform's ObserveQL planner explicitly refuses to lower a query that lacks a tenant predicate; the test suite includes negative tests asserting that a hand-crafted attempt to omit tenant_id from a user-supplied query is rejected with an error.

Per-Tenant Actors and the Quarantine

The runtime isolation primitive is the per-tenant actor: a goroutine plus a bounded mailbox that owns the per-tenant CEP (Complex Event Processing) state. The actor model dates to Carl Hewitt's 1973 paper and remains the cleanest way to ensure that a misbehaving tenant cannot starve a well-behaved one of CPU time:

type Actor interface {
    Start(ctx context.Context) error
    Stop() error
    Mailbox() chan<- signal.Signal
}

Each actor's mailbox is bounded at 4,096 slots. A flooding tenant fills its own mailbox and is back-pressured; its neighbours are unaffected. If the actor panics (a bug in a CEP rule, an out-of-bounds attribute access), the supervisor restarts it with exponential backoff. If it panics repeatedly within a sliding window — the default is five restarts in sixty seconds — the supervisor promotes the actor to a quarantine state and refuses to recreate it until a human revokes the quarantine through the operator console.

The quarantine is the difference between a system that gracefully isolates a corrupt tenant and a system that endlessly thrashes a panicking actor at the cost of every other tenant's latency budget. The pattern is borrowed wholesale from Erlang/OTP's supervisor behaviour, documented in Joe Armstrong's Programming Erlang (2nd edition, Pragmatic Bookshelf, 2013), and remains the best-articulated discipline for fault tolerance in long-running systems.

Sampling Without Losing the Signal

The economics of trace storage make full retention infeasible above a certain ingest rate. A system that produces ten million spans per second cannot store ten million spans per second indefinitely; the storage cost would exceed the revenue. The discipline that bridges the gap is sampling: select a small, representative subset of traces to persist and discard the rest.

The trap is that naïve uniform sampling discards exactly the traces that matter most. A 1% uniform sample over a system that experiences a 0.5% error rate will, on average, retain one error trace out of every two hundred — sufficient for a customer-facing complaint to surface a trace whose details have been irretrievably lost.

Tail-Based Sampling and the Mathematics of Z-Score Priority

The architectural response is tail-based sampling: defer the keep/drop decision until enough of the trace is observed to compute a meaningful priority score, then sample preferentially toward high-score traces. The platform's score function combines static attributes (severity, duration, parent-sampled bit) with a per-service rolling z-score:

score = base
      + 10 × latency_zscore
      + 25 if parent_sampled

The z-score is computed against an exponentially weighted moving baseline maintained per service:

func (t *ewmaTracker) zscore(key string, x float64) float64 {
    t.mu.RLock()
    defer t.mu.RUnlock()
    n := t.seen[key]
    if n < 30 {
        return 0
    }
    mean := t.mean[key]
    std := math.Sqrt(t.vari[key])
    if std < 1e-9 {
        return 0
    }
    return (x - mean) / std
}

The thirty-observation floor exists because a single outlier in a cold tracker would otherwise dominate the priority, biasing the sample toward the first request the platform happened to see for that service. The alpha coefficient (0.05) is calibrated so the moving mean reaches half-life in approximately twenty observations, fast enough to follow real workload shifts and slow enough to ignore single-request noise. The mathematics of EWMA control is detailed in Roberts' 1959 paper Control Chart Tests Based on Geometric Moving Averages and remains the standard signal-processing primitive for anomaly detection on streaming data.

Adaptive Rates and Cold-Start Behaviour

A second consideration is what happens at start-up, before the EWMA has seen enough traffic to compute a meaningful baseline. The platform's sampler accepts a base rate (default 10%) that governs uniform sampling for traces whose z-score is uninformative. As the baseline matures, the z-score component progressively dominates, and the sampler transitions from "10% uniform" to "10% baseline plus all anomalies." The transition is smooth, requires no operator intervention, and is observable through a Prometheus counter that exposes the per-tenant ratio of score-driven versus base-rate samples.

The state of every per-service tracker can optionally be persisted to a StateStore (Redis in production, in-memory in tests) on a configurable cadence. A process restart then resumes with the baselines learned by its predecessor, rather than throwing away an hour's worth of statistical adaptation. This is a write-behind strategy: the persistence is best-effort and lives off the hot path, so a Redis outage degrades to "lost baselines until Redis recovers" rather than "ingest stalls."

The Supervisor Pattern: Borrowing from Erlang Without Importing the VM

The defining property of an observability platform is that it is the last line of defence: when everything else is failing, the observability layer must remain visible. This places an unusual reliability requirement on the platform — it must outlive the failures it observes.

The Erlang/OTP ecosystem has been articulating the discipline of let it crash, then supervise for nearly four decades. The premise is that defensive programming inside a process is futile (every defensive check is itself a source of bugs) and that the correct architectural primitive is a supervisor that detects process death, restarts the process from a known-good initial state, and tracks restart frequency to detect pathological loops.

One-for-One Restart and Exponential Backoff

The platform's supervisor implements the OTP one-for-one strategy: when an actor crashes, only that actor is restarted; its siblings are unaffected. The restart is gated by an exponential backoff bounded at thirty seconds:

type Options struct {
    MailboxSize    int
    MaxRestarts    int
    RestartWindow  time.Duration
    BackoffMin     time.Duration
    BackoffMax     time.Duration
    HealthInterval time.Duration
}

The backoff exists to prevent a thrashing crash loop from saturating the CPU with restart attempts. The window-bounded restart count exists to detect crash loops that survive the backoff (a deterministic crash on the first message, for example) and escalate them to quarantine before they consume operational attention indefinitely. The discipline is the same one Charity Majors articulates in Observability Engineering (O'Reilly 2022): the system that fails predictably is more operable than the system that fails creatively.

Quarantine as a Civilised Circuit Breaker

The escalation from "repeatedly restarting" to "quarantined" is a deliberate refusal to confuse persistence with progress. A traditional circuit breaker (Hystrix-style, documented in Michael Nygard's Release It!, 2nd edition, Pragmatic Bookshelf, 2018) opens after a failure threshold and probes periodically to re-close. A quarantine, in contrast, requires human acknowledgment before it lifts. The asymmetry is intentional: an actor that has crashed five times in sixty seconds is exhibiting a bug, not a transient fault, and silently retrying it indefinitely is exactly the behaviour that lets a single buggy tenant degrade a multi-tenant cluster.

The quarantined state is exposed through a Prometheus gauge (observex_actor_quarantined_total) and surfaced in the operator console. The console offers a single button — "Revoke Quarantine" — that requires the operator to acknowledge the prior crash count before re-enabling the actor. The friction is the feature.

The Spillover Side-Car: Durable Buffering at the Margin

Bounded mailboxes provide back-pressure, but back-pressure is a blunt instrument. A tenant that briefly exceeds its mailbox capacity during a downstream stall (a ClickHouse merge storm, a network blip during a redeploy) should not lose data permanently — it should be buffered until the downstream catches up. The platform's answer is the spillover side-car: a NATS JetStream stream that absorbs overflow signals and re-injects them as the mailbox drains.

type Options struct {
    URL        string
    StreamName string
    MaxAge     time.Duration
    MaxBytes   int64
}

JetStream is chosen over Kafka for a specific reason: it ships in a single Go binary, requires no ZooKeeper, and is operationally proportionate to the back-pressure scenario it absorbs. The retention is bounded at one hour and one gigabyte per stream — long enough to absorb a typical operational blip, short enough that the spillover never grows into a parallel storage tier with its own retention policy.

The critical design property is graceful degradation. If JetStream itself is unavailable, the spillover degrades to the prior behaviour: drop on mailbox-full and emit a Prometheus counter. The supervisor keeps running. The spillover package's failure mode is never fatal to the platform — it is an optional improvement to an already-correct system, not a load-bearing component whose loss is a production incident.

The Storage Tier: ClickHouse, MergeTree, and the Cold-Path Decision

The choice of storage substrate determines the economic profile of an observability platform more than any other architectural decision. The two dominant options at the time of writing are LSM-tree-based key-value stores (Cassandra, HBase, ScyllaDB) optimised for write throughput, and columnar OLAP engines (ClickHouse, Druid, Pinot) optimised for analytical query latency. For workloads where queries scan large time ranges across selected columns, the columnar engines win by an order of magnitude.

ClickHouse and the MergeTree Discipline

The platform's storage substrate is ClickHouse, configured with MergeTree family tables. The ORDER BY key on every signal table is (tenant_id, [secondary partition column], timestamp). The ClickHouse documentation describes this choice precisely: "The primary key in MergeTree determines the order of data on disk." A query of the shape WHERE tenant_id = ? AND timestamp > ? reads exactly the granules that contain the requested time range, and the granule is typically 8,192 rows. The result is sub-millisecond execution for the typical operator query, and the result is observable in system.query_log rather than asserted in the documentation.

ClickHouse's secondary optimisation is columnar compression. Each column is stored as a contiguous run of values, compressed with a column-appropriate codec (Delta, DoubleDelta, Gorilla for timestamps; LZ4 or ZSTD for strings). The compression ratio on a typical metrics table is 8–15×; on a typical logs table with repetitive structured payloads it can exceed 30×. The economics are documented at length in Aleksei Milovidov's ClickHouse Architecture talks and in the original Druid paper (Yang et al., SIGMOD 2014).

Hot Disk to S3 via TTL

A column store is not free. Hot SSD storage on a managed Kubernetes node costs roughly $0.10 per gigabyte-month; an equivalent gigabyte in S3 Standard costs roughly $0.023, and in S3 Infrequent Access roughly $0.0125. For a platform whose retention requirement is "ninety days at full fidelity, with the last seven days hot," the cost differential dominates the operational budget.

ClickHouse's response is the multi-disk storage policy. A table is assigned to a policy that names one or more "disks" (in the ClickHouse sense — disk can be a local volume, an S3 bucket, or any backend that implements the disk interface), and a TTL expression controls when data migrates between disks:

CREATE TABLE logs (...)
ENGINE = MergeTree()
ORDER BY (tenant_id, service_name, timestamp)
TTL toDateTime(timestamp) + INTERVAL 7 DAY TO DISK 'cold_s3',
    toDateTime(timestamp) + INTERVAL 90 DAY DELETE
SETTINGS storage_policy = 'hot_cold';

The seven-day boundary keeps the most-queried data on local SSD; the ninety-day boundary keeps the long-tail data on S3 at one-fifth the cost. A separate controller process (services/cold-tier) reads the ClickHouse system.parts table and exports per-table, per-disk part counts as Prometheus gauges, so a stalled S3 lifecycle is observable as a metric rather than as an unexpected end-of-month invoice. The Wikipedia entry on Hierarchical storage management is the foundational reference for this pattern, which predates the cloud by decades.

The Cost-Based Optimiser

A query language is only as useful as its query planner. A naïve translator from ObserveQL to ClickHouse SQL produces correct results but suboptimal ones: a query that filters on a high-cardinality column and groups on a low-cardinality one should be planned differently from its mirror image. The platform ships a cost-based optimiser skeleton in pkg/observeql/cbo.go that examines the query, consults a small set of statistics about the target tables, and reorders predicates to push the most selective filter first.

The optimiser is deliberately minimal. The full cost-based optimisation literature — Selinger et al., Access Path Selection in a Relational Database Management System (SIGMOD 1979), is the canonical reference — addresses query graphs of arbitrary complexity. The platform's query language is constrained enough that a much smaller optimiser captures most of the practical benefit. The discipline is the same one expressed by Rob Pike's rule of thumb in Notes on Programming in C: "Fancy algorithms are slow when n is small, and n is usually small."

The Query Language Question: Native vs Compatible

Every observability platform must answer the question of how operators express queries. The dominant options are SQL (familiar, ubiquitous, but verbose for time-series shapes), PromQL (concise for metrics, awkward for logs and traces), LogQL (pleasant for logs, useless for metrics and traces), and a custom language (consistent across signal types, but every operator must learn it).

ObserveQL and the Allow-Listed Planner

The platform's native language, ObserveQL, is a SQL-shaped query language with two deliberate constraints. First, every query must select from one of three sources (metrics, logs, traces) and may not join across them — cross-signal correlation is performed by the operator via trace_id rather than by the planner via JOIN. Second, every column and function reference is checked against an explicit allow-list:

var allowedColumnsPerSource = map[string]map[string]bool{
    "metrics": {
        "tenant_id":   true,
        "metric_name": true,
        "timestamp":   true,
        "value":       true,
        "labels":      true,
        "received_at": true,
    },
    "logs": {
        "tenant_id":    true,
        "service_name": true,
        "severity":     true,
        "body":         true,
        "attributes":   true,
        "timestamp":    true,
        "trace_id":     true,
        "span_id":      true,
    },
}

The allow-list is the single defence against an attacker smuggling SQL through a column name. The OWASP SQL Injection Prevention Cheat Sheet documents this pattern explicitly: "If validation is the only defence, validation must be exhaustive." Combined with parameter binding for all user-supplied values, the planner produces queries that are SQL-injection-impossible by construction.

The planner's output is ClickHouse SQL with ? placeholders for every value. The executor injects the tenant predicate from the bearer-token context — not from the query string — so a malicious or buggy query cannot omit the tenant filter even if it omits the WHERE tenant_id = ? clause itself.

PromQL and LogQL as Compatibility Shims

A native query language is the right long-term answer, but it imposes a switching cost on operators who arrive with years of PromQL or LogQL muscle memory. The platform's response is to ship compatibility shims as additional endpoints on the query engine:

PromQL  →  /prom/api/v1/{query,query_range}
LogQL   →  /loki/api/v1/{query,query_range}

The shims are hand-rolled recursive-descent parsers that produce an internal AST, followed by a translator that lowers the AST to ClickHouse SQL of a fixed shape:

SELECT toStartOfInterval(timestamp, INTERVAL 30 SECOND) AS t,
       avg(value) AS v
FROM   metrics
WHERE  metric_name = ?
  AND  attributes['service'] = ?
  AND  timestamp BETWEEN ? AND ?
GROUP BY t
ORDER BY t

The translator supports the subset of PromQL most commonly used in dashboards (rate, irate, increase, avg_over_time and friends, sum/avg/max/min/count aggregations, binary scalar operations) and the subset of LogQL used for log filtering and counting. The unsupported tail is rejected with a clear error pointing the operator at the documentation rather than silently producing incorrect results — the cardinal sin of compatibility layers everywhere.

The discipline of fail loud on the unsupported case is what separates a useful compatibility shim from a "looks like it works, sometimes" liability. The PromQL specification is large; the goal of the shim is not to reimplement Prometheus inside ClickHouse but to let an operator copy a dashboard panel from a Prometheus deployment, point it at the platform, and have it work for the cases dashboards actually use. Operators who require the full PromQL semantics — including subqueries, the histogram functions, and the precise extrapolation behaviour at range boundaries — are directed to keep Prometheus alongside the platform.

Parameterised Queries as a Security Invariant

Every value in every translated query is bound as a parameter, not interpolated into the SQL string. The pattern is documented at length in the OWASP Cheat Sheet but bears repeating:

func matcherSQL(m matcher) (string, any, error) {
    col := fmt.Sprintf("attributes['%s']", sqlEscape(m.name))
    switch m.op {
    case matchEq:
        return col + " = ?", m.val, nil
    case matchNeq:
        return col + " != ?", m.val, nil
    case matchRegex:
        if _, err := regexp.Compile(m.val); err != nil {
            return "", nil, fmt.Errorf("bad regex %q: %w", m.val, err)
        }
        return "match(" + col + ", ?)", m.val, nil
    }
}

The only string that touches the SQL itself is the label name in the attributes['<key>'] accessor, and that string is gated through an sqlEscape helper that returns the literal "INVALID" for any character outside [A-Za-z0-9_\-.]. A malicious caller who attempts to inject SQL through a label name produces a query that ClickHouse rejects with a syntax error, not one that exfiltrates a different tenant's data.

The Visualization Layer: Where Observability Platforms End

In the architecture of observability, the visualisation layer is structurally privileged. It is the only surface most operators ever touch. The ingest pipeline is invisible by design. The storage engine is invisible by design. The query planner is, ideally, invisible by design. What remains visible — what defines whether the platform feels like a product or a collection of services — is the rendering surface that converts rows of data into the lines, bars, and waterfalls humans use to reason about systems.

This privilege creates an asymmetric design pressure. Every other component of the platform can be replaced behind a stable interface without users noticing. The visualisation layer cannot. When operators learn the keyboard shortcuts of a particular dashboard, when they internalise the visual language of severity colours, when they bookmark URLs and embed them in runbooks, they are forming muscle memory that is expensive to displace.

The Brand-Fragmentation Tax

The temptation to delegate the entire visualisation layer to Grafana is rational. Grafana is, in raw capability, a more sophisticated visualisation tool than any small team can build. Its plugin ecosystem is the largest in the observability space. Its query editor has decades of accumulated UX refinement. The Grafana Labs documentation correctly notes that the ClickHouse datasource plugin handles metrics, logs, and traces in a unified panel system.

What this analysis misses is the cost of brand fragmentation. When a transactional flow takes the operator from an alert email into the platform's web console, from the console into a dashboard, and from the dashboard into a trace waterfall, every transition between visual languages imposes a small cognitive tax. The keyboard shortcuts change. The error states render differently. The "back" button sometimes returns to the platform and sometimes returns to Grafana, depending on which UI initiated the navigation. The transitions are small individually; in aggregate, across thousands of operator-hours per quarter, they constitute a measurable productivity loss.

The architectural response is not to displace Grafana. Grafana remains the right answer for power users with existing dashboards and PromQL/LogQL muscle memory, and the platform ships provisioned Grafana dashboards alongside its native UI for exactly this audience. The architectural response is to ensure that operators who never want to leave the platform's own console can perform every common workflow without being routed to a third party's domain.

The Native Canvas vs the Imported Framework

The native operator console embeds a four-tab visualisation workbench — Metrics, Logs, Traces, Dashboards — inside a single Go binary that bundles its static assets through embed.FS and ships as one container image. The charting primitive is hand-rolled vanilla JavaScript drawing to HTML5 Canvas:

class ObservexChart {
  constructor(canvas, opts = {}) {
    this.canvas = canvas;
    this.ctx = canvas.getContext("2d");
  }
}

The decision to build rather than buy was contested. The alternative was uPlot, which is the fastest known charting library for time-series rendering in JavaScript and which has been measured to render 100,000-point series in under 16 milliseconds on commodity hardware. The cost of uPlot is not its 40-kilobyte minified bundle; the cost is the build pipeline (which the operator console deliberately does not have), the version-coordination problem (a security advisory in a transitive dependency becomes the platform's problem), and the visual coherence problem (uPlot has its own opinions about typography, padding, and interaction).

The chart's typical panel renders between 200 and 2,000 points. At those sizes, the perceived rendering speed of the hand-rolled implementation and uPlot are identical. The advantage of uPlot's optimisation matters at workloads the platform does not serve. The advantage of the hand-rolled implementation — visual coherence with the rest of the operator console, no external CSS injection, no script-src CSP exemptions, no vendor-update treadmill — is permanent across all workloads.

The principle generalises. The small library beats the large library when the consumer controls the surface area. The condition is essential: a team that needs the full surface of a domain (a production-grade rich-text editor, a fully spec-compliant SQL parser, a complete WebGL terrain renderer) should not reimplement it. A team that needs a known, bounded subset (a time-series chart with tooltips, a Gantt waterfall, a basic trace search) should consider whether the cost of ownership is lower than the cost of dependency. The trade-off is examined in detail in Dan McKinley's Choose Boring Technology essay (2015), which remains the cleanest articulation of the discipline.

Server-Sent Events vs the Push-Bus Temptation

The platform's logs tab supports a live-tail mode that streams new log lines to the browser as they arrive. The instinct in distributed-systems engineering is that this problem is shaped like publish-subscribe: a log line arrives at the ingest gateway, a browser tab somewhere wants to see it, therefore a message bus should connect them.

The instinct is wrong for the workload. The platform's NATS deployment exists to absorb supervisor mailbox overflow; adding a logs.tail.<tenant_id> subject would put NATS in the hot path of every log line. The cost is what NATS becomes once it carries that traffic: a tier of message-bus capacity planning that grows in lockstep with ingest volume, an on-call surface for "the live-tail bus is degraded," and a coupling between the ingest gateway and the query engine that breaks the actor-per-service model the platform was deliberately built around.

The alternative is Server-Sent Events with a one-second poll of ClickHouse, advancing a timestamp > last_seen cursor each tick:

sql := "SELECT timestamp, severity, service_name, body, trace_id, span_id, attributes " +
    "FROM logs WHERE tenant_id = ? AND timestamp > ?"
for {
    select {
    case <-ctx.Done():
        return
    case <-pollTicker.C:
        args[1] = lastSeen
        rows, err := client.Query(ctx, sql, args...)
    }
}

The polling design takes advantage of a ClickHouse property: the ORDER BY key on the logs table is (tenant_id, service_name, timestamp). A query of the shape WHERE tenant_id = ? AND timestamp > ? reads exactly one granule per tenant. The cost of a single poll is one ClickHouse round-trip with sub-millisecond execution. At one thousand simultaneous live-tail connections — more than any single tenant in production realistically generates — the workload is one thousand queries per second, which ClickHouse handles below its idle baseline.

SSE itself is the right transport for unidirectional server-to-client streaming. It runs over plain HTTP, requires no protocol upgrade, integrates with existing CORS and authentication middleware, and survives every reverse proxy and load balancer in the ecosystem. The HTML5 specification ratified SSE in 2009; every browser shipped since 2011 supports it. WebSockets would be the alternative, but WebSockets buy bidirectional communication that the live-tail workload does not need, at the cost of protocol-upgrade machinery that the live-tail workload would have to debug.

The design carries one operationally important header:

w.Header().Set("X-Accel-Buffering", "no")

The X-Accel-Buffering: no header tells nginx (and proxies that respect the convention) not to buffer the response body. Without it, the SSE stream would deliver in batches sized to the proxy's buffer, defeating the live-tail premise. The convention is documented in the nginx wiki and is the kind of small operational detail that distinguishes a working SSE implementation from a frustrating one.

Self-Observability: The Platform That Watches Itself

The discipline of observability applies recursively. A platform that observes other systems must observe itself with at least the same rigour, because the moment the platform's own behaviour becomes opaque, the operator loses confidence in everything it reports about its customers. Brendan Gregg's USE method (Utilisation, Saturation, Errors) and the Google SRE book's Four Golden Signals (Latency, Traffic, Errors, Saturation) are the canonical frameworks; both prescribe that every service should expose enough telemetry for an operator to answer the question "is this service healthy right now?" without reading source code.

Prometheus Metrics as Internal Currency

Every service in the platform exposes a /metrics endpoint in the Prometheus text format. The instrumentation is grouped into a small, stable taxonomy:

MetricTypeLabels
observex_ingest_signals_received_totalCountertenant, type
observex_ingest_signals_dropped_totalCountertenant, type, reason
observex_ingest_pipeline_queue_depthGauge—
observex_wal_write_secondsHistogram—
observex_clickhouse_inflight_batchesGauge—

The reason label on signals_dropped_total carries one of a fixed enumeration — overload, no_tenant, decode, actor_full, sampled_out, wal_error — so a drop in throughput can be attributed to a specific stage of the pipeline within seconds of observing it. The taxonomy follows the discipline articulated in Prometheus: Up & Running (Brazil, O'Reilly 2018): metric names should describe what is being measured, label names should partition the measurement, and label values should be drawn from a bounded set known at design time.

The deliberate constraint on label cardinality applies to the platform's own metrics as much as to its customers'. A signals_received_total counter labelled by tenant_id is acceptable because the tenant set is bounded. The same counter labelled by request_id would be a self-inflicted cardinality bomb.

pprof, Flame Graphs, and Honest Profiling

Beyond metrics, every service optionally exposes Go's net/http/pprof handlers behind a configuration gate (OBSERVE_X_PPROF_ENABLED=true). The gate exists because pprof exposes detailed runtime information — heap allocation sites, goroutine stacks, CPU profiles — that is invaluable to an operator and dangerous to expose to the public internet. The gate defaults to off; an operator must consciously opt in, typically through a sidecar that exposes the pprof port only on the cluster's internal network.

The output of a CPU profile is best visualised as a flame graph (Brendan Gregg's 2011 invention, documented at brendangregg.com/flamegraphs.html). A flame graph shows where time is being spent in the call stack as a horizontal bar chart, with stack depth on the Y axis and time-share on the X axis. Engineers reading a flame graph can identify a hot function in seconds; engineers reading the same data as a textual profile take minutes and frequently reach the wrong conclusion.

The discipline is to profile under realistic load, not in a synthetic micro-benchmark. The platform's tests/benchmarks/ directory contains workload generators that exercise the ingest pipeline at production-shape ratios of metrics to logs to traces, and the profile-driven optimisations that have shipped over the platform's lifetime — buffer pooling in the OTLP decoder, a switch from encoding/json to a code-generated serializer in the WAL replay path — all originated from flame graphs collected under those loads.

The Plugin Surface: WASM as a Capability Boundary

An observability platform that serves many tenants will, sooner or later, be asked to support per-tenant business logic. A specific customer wants to drop signals matching a regular expression. Another wants to enrich logs with a lookup against an internal directory. A third wants to compute a domain-specific score and attach it as an attribute. The naïve answer is to add configuration flags to the platform until it accommodates every request. The naïve answer fails the moment the configuration surface exceeds the platform's ability to test it.

Wazero, ABI Discipline, and the Refusal of Native Plugins

The platform's plugin host is built on Wazero, a pure-Go WebAssembly runtime that requires no CGO, no external WASM compiler, and no native shared library loader. Plugins are compiled to .wasm files (typically from Rust, AssemblyScript, or TinyGo source), uploaded through the tenant API, and executed inside a Wazero sandbox with a deliberately minimal ABI: read the signal, return a transformed signal or a drop decision.

The alternative — native Go plugins via plugin.Open — was rejected on three grounds. First, Go's plugin loader requires that the plugin and the host be built with identical compiler versions and dependency graphs, which makes plugin distribution nearly impossible in practice. Second, native plugins share the host's address space; a plugin crash crashes the host. Third, native plugins can call any function in the host process, which makes least-privilege impossible to enforce.

WASM addresses each. The plugin runs in a sandboxed linear-memory region; it crashes within its sandbox without affecting the host. The host exposes only the ABI functions it chooses; a plugin cannot call into the host's storage layer or filesystem. The compiler-version-coupling problem disappears because WASM is a portable bytecode targeted by multiple front-end languages.

The cost of WASM is overhead: the per-invocation latency is higher than a native function call by roughly an order of magnitude. For an observability plugin processing one signal at a time, the absolute overhead is in the low microseconds — invisible against the surrounding pipeline cost. For a plugin processing every signal in a million-signal-per-second pipeline, the overhead is the dominant cost and the plugin design is structurally wrong. The platform's plugin documentation states this trade-off explicitly: plugins are for tenant-specific enrichment and drop logic, not for hot-path transformations that belong in the platform proper.

Concurrency Discipline: Goroutines, Channels, and the Anti-Pattern Library

Go's concurrency primitives are simultaneously the language's greatest strength and the most reliable source of production bugs. Goroutines are cheap, channels are expressive, and the result is that a poorly-disciplined codebase produces a deadlock per release.

The Bounded Channel as a Universal Primitive

Every channel in the platform's hot path is bounded. The pattern is so universal that the codebase treats an unbounded channel as a code smell warranting review:

ingestCh := make(chan signal.Signal, 65536)
out := make(chan signal.Signal, 1024)
mailbox := make(chan signal.Signal, 4096)

The bound is the budget. The budget is profiled. The budget is enforced. The mathematics is the same throughout: a bounded queue with a known service rate has a bounded latency tail; an unbounded queue has an unbounded latency tail and an unbounded memory footprint. Little's Law (the relationship L = λW between queue length, arrival rate, and waiting time, documented in John Little's 1961 paper) is the underlying discipline; every bounded channel in the codebase is a Little's Law statement made executable.

The Context as a Cancellation Mandate

Every function that can block accepts a context.Context as its first argument, and every blocking operation is guarded by a select on ctx.Done():

select {
case out <- sig:
case <-ctx.Done():
    return
}

The discipline is documented in the Go blog post Go Concurrency Patterns: Context (Sameer Ajmani, 2014) and remains the cleanest articulation of the rule. A function that ignores cancellation is a function that holds resources indefinitely on shutdown; a service composed of such functions cannot be drained gracefully and cannot be tested for shutdown correctness. The platform's lint configuration enforces context propagation through staticcheck's SA1029 and SA4006 checks, catching the most common violations at PR time.

sync.Pool and the Allocation Discipline

The OTLP decoder is the platform's allocation-heaviest hot path: every incoming protobuf message produces a graph of small Go objects that are discarded after a single pass through the pipeline. The garbage collector handles this correctly but pays a measurable cost in P99 latency under load. The mitigation is sync.Pool for the decoder's intermediate buffers:

var decoderBufPool = sync.Pool{
    New: func() any { return make([]byte, 0, 4096) },
}

The pool reduces allocation rate on the hot path by an order of magnitude. The cost is that pool users must reset the buffer before returning it (otherwise a stale length leaks across requests) and must not retain references after Put (otherwise the next consumer mutates state they think they own). The pattern is correct, well-documented in the Go runtime source, and dangerous; the platform's code reviews treat pool changes with the same care as lock-free data structures.

Schema Evolution Without Downtime

A platform with state is a platform that will need to migrate its state. The naïve approach — take a downtime window, run ALTER TABLE, take the application back up — is unacceptable for a multi-tenant platform whose customers expect 24/7 ingest. The discipline is the expand-contract pattern, articulated in Pramod Sadalage and Scott Ambler's Refactoring Databases (Addison-Wesley 2006) and refined by Martin Fowler in his article Evolutionary Database Design.

Expand, Migrate, Contract

The pattern has three phases. Expand adds the new column or table without removing the old; both are written, only the old is read. Migrate backfills data from the old representation to the new and switches reads to the new. Contract removes the old representation only after the new is confirmed correct in production.

For ClickHouse, expand-contract is straightforward because ALTER TABLE ADD COLUMN is a metadata-only operation that does not rewrite existing parts. For PostgreSQL, the operation is more involved — adding a column with a default value rewrites the table — but the same discipline applies. The platform's migration files (pkg/storage/clickhouse/migrations/*.sql and services/tenant-api/store/migrations/*.sql) are numbered, idempotent, and applied automatically on service start, following the pattern documented in Mark Fussell's Refactoring Databases talks.

The Backward-Compatibility Invariant

The discipline beyond expand-contract is every release must accept the previous release's data. A schema change shipped in release N must be readable by release N-1's code, until release N+1 has confirmed N is stable. This protects against the most common failure mode of database migrations: a forward-compatible change that is silently not backward-compatible, observed only when a rollback is required during an incident.

The pattern generalises beyond databases. The platform's wire protocols (OTLP for ingest, NDJSON for query results, Arrow IPC for binary query responses) are versioned with explicit Content-Type parameters, and every reader is tested against the previous major version's output. The discipline is what enables zero-downtime deployment of a system whose ingest path and storage path can roll at different cadences.

The Economics of Boring Technology

The architectural decisions described in this article share a common pattern: at each significant fork, the platform chose the substrate with the longest track record over the substrate with the most ambitious capability. The choices were:

  • Canvas 2D over WebGL for charting.
  • Server-Sent Events over WebSockets for live tail.
  • Imperative DOM over a virtual-DOM framework for table rendering.
  • JSONB over a normalised relational schema for dashboard layouts.
  • Recursive-descent parsers over generator-driven parsers for the query-language shims.
  • PostgreSQL over a bespoke event store for the control plane.
  • ClickHouse over a vendor-specific time-series database for telemetry storage.
  • Go's standard library net/http and encoding/json over higher-performance third-party alternatives in places where the absolute performance difference is invisible against surrounding I/O.

The argument for this pattern is articulated most clearly in Dan McKinley's 2015 essay Choose Boring Technology: "The nice thing about boringness is that the capabilities of such tools are well understood. But more importantly, their failure modes are well understood." Every one of the substrates listed above has been operationally exercised for at least a decade. Every one has a documented failure mode, a documented recovery procedure, and a population of engineers who have debugged it under production stress.

The cost of boring technology is that it is less exciting to write about. The benefit is that the operator who is paged at three in the morning has a non-zero chance of being able to diagnose the failure without first reading the README for an unfamiliar framework. In the long-run economic accounting of an observability platform, the second consideration dominates.

The Innovation-Token Budget

Dan McKinley's essay introduces the concept of an innovation token: the limited supply of novel-technology adoption a team can metabolise per unit time. Every new framework, every new database, every new language consumes a token. The token budget is small — McKinley estimates three for a typical team — and the cost of overdrawing is a system that no single engineer fully understands.

The platform's innovation tokens were deliberately concentrated. The novel-technology adoption sits where the platform's actual differentiation lives: the ObserveQL planner, the EWMA-baseline sampler, the WASM plugin host, the hot-cold storage policy. Every other layer — the HTTP server, the configuration loader, the dependency injection, the build system, the deployment manifests — is conventional Go with conventional Kubernetes, indistinguishable from what a thousand other Go services have done before it.

The result is a system that an engineer joining the team can read end-to-end in a week. The system's interesting parts are interesting because they need to be, not because the framework imposed novelty as a precondition for participation.

Operational Rigor: Runbooks, Postmortems, and the Discipline of Reproducibility

Code is half of the system. The other half is the humans who operate it. In the high-pressure environment of an observability platform — which is, by construction, the system other engineers turn to when their own systems are misbehaving — operational discipline matters at least as much as architectural discipline.

Runbooks as Code

The platform's docs/runbooks/ directory contains codified responses to every documented failure mode: a stalled S3 lifecycle, a tenant whose mailbox is filling faster than the supervisor can drain it, a ClickHouse merge storm that is causing the query latency to balloon. Each runbook is version-controlled, idempotent, and structured so that a fresh-on-call engineer can execute it without prior context.

The discipline is borrowed from the Google SRE book, where it is stated as a principle: "the runbook is the source of truth for response, the dashboard is the source of truth for diagnosis." A runbook that requires undocumented "tribal knowledge" is a bug, not a runbook. A runbook that produces different outcomes depending on which engineer executes it is a bug, not a runbook.

Blameless Postmortems

Every production incident, regardless of size, produces a postmortem written under the blameless discipline first articulated by John Allspaw in his 2012 essay Blameless PostMortems and a Just Culture. The discipline is not "no-one is held accountable." It is "the system that allowed the failure is held accountable, the people who acted within the system are not." The asymmetry produces honest postmortems; teams that punish individuals for failures produce postmortems that obscure root causes.

The platform's postmortem template asks five specific questions: what was the timeline, what was the impact, what was the root cause, what made the impact larger than the root cause alone would have caused, and what concrete changes will reduce the probability or impact of a recurrence. The questions are derived from the Etsy and Honeycomb postmortem templates, both of which are public and remain the cleanest examples of the form.

The Deployment Path: Helm, ArgoCD, and the GitOps Discipline

A platform is only as reliable as its deployment story. A system whose deployment requires manual steps will eventually be deployed inconsistently; a system whose deployment is a single git push is a system that can be rolled back as fast as it was rolled forward.

The platform ships as a Helm chart in deploy/helm/observex/ that templates every service, every ConfigMap, every Secret reference, every NetworkPolicy. The chart is infrastructure-agnostic: it accepts a PostgreSQL DSN, a ClickHouse address, a Redis address, and a NATS URL as values, and trusts the operator to provide those substrates via whichever managed-service or self-hosted path they prefer. The discipline is documented in the Helm Best Practices guide and is the difference between a chart that ships a working system and a chart that ships an opinion.

GitOps and the Reconciliation Loop

The deployment topology beyond Helm is GitOps, the pattern articulated by Weaveworks in 2017 and now embodied in tools like ArgoCD and FluxCD. The premise is that the desired state of the production cluster lives in a Git repository, and a controller running inside the cluster continuously reconciles the actual state to match. A change to production is a pull request to the infrastructure repository; a rollback is a git revert.

The discipline eliminates configuration drift, the slow accretion of manual kubectl edit operations that diverge the cluster from the documented desired state until the documented state is unrecoverable. The platform's deployment documentation makes the policy explicit: no manual changes to production. If a change is worth making, it is worth committing.

The Cryptographic Boundary: TLS, mTLS, and the Identity Surface

The platform's network boundary is hardened with the cryptographic primitives that the modern web has standardised on: TLS for confidentiality, mutual TLS (mTLS) for peer authentication, and short-lived bearer tokens for authorisation. The choices are deliberate and the implementation discipline is uniform.

TLS by Default, mTLS for Peer Authentication

Every external receiver (HTTP, gRPC) supports TLS via configurable certificate and key paths. Setting OBSERVE_X_TLS_CA_FILE additionally enables mTLS, requiring every client to present a certificate signed by the configured CA. The pattern is the standard one for high-security ingest paths: a leaked API key is a survivable failure (rotate the key, the system recovers), but a leaked API key plus a missing certificate is an exploitable vulnerability.

The internal service-to-service paths use mTLS exclusively, with certificates issued by an internal CA and rotated on a fixed cadence. The pattern is described in detail in the Zero Trust Networks book (Gilman and Barth, O'Reilly 2017) and is the modern default for inter-service security in container-orchestrated environments.

Bearer Tokens and the Key-Hash Discipline

API keys are stored as BLAKE3 digests with an Argon2id-derived fingerprint. The raw key is shown to the operator exactly once at issuance time, and never again. A leaked database backup leaks digests, not keys; an attacker who steals the digest cannot use it to authenticate.

The BLAKE3 choice over SHA-256 is performance-motivated (BLAKE3 is roughly five times faster for the small inputs typical of key validation) and security-equivalent. The Argon2id choice over PBKDF2 or scrypt is the OWASP recommendation as of the latest revision of the Password Storage Cheat Sheet. The defaults — memory cost 64 MB, time cost 3, parallelism 4 — match the OWASP guidance for interactive authentication paths.

Federation and the Distant Future of Cross-Region Observability

The architecture described in this article is a single-region deployment. A multi-region observability platform is harder, and the platform's federation work — a DuckDB-backed executor over S3-exported Parquet files — is the foundation rather than the conclusion.

The principle is federate at the storage layer, not at the query layer. The naïve cross-region approach is to ship every query to every region and merge the results in the gateway. The principled approach is to materialise the data once, in a format that any executor can read, and let each region's executor scan the slice it owns. Parquet is the right materialisation format because it is columnar, compressed, schema-evolution-aware, and supported by every modern analytical engine. DuckDB is the right executor because it can read Parquet directly from S3 without first materialising it to local disk, and because it ships as a single binary with no operational overhead.

The pattern is the same one that powers AWS Athena, Google BigQuery Federation, and Trino: separate the compute from the storage, and let the compute scale independently of the data. The platform's implementation is deliberately minimal — the goal is to validate the pattern, not to ship a cross-region query optimiser — but the substrates are chosen so the pattern can grow without architectural rework.

The Anomaly Detection Surface: ML Without the Hype

Anomaly detection in observability is the field with the largest gap between promise and delivery. The vendor pitch is invariable: "we apply machine learning to find anomalies you would miss." The operational reality is invariable: the ML model produces false positives that the operator silences until the genuine anomaly arrives and is silenced with the false ones.

The platform's approach is deliberately conservative. The default anomaly detector is a rolling z-score: for each metric, maintain an exponentially weighted moving mean and variance, and flag observations whose z-score exceeds a configurable threshold. The implementation is twenty lines of Go; the false-positive rate is bounded and tuneable; the operator can reason about why a specific point was flagged.

For tenants who want more sophisticated detection, the platform ships an ml-anomaly-detector service with a Predictor interface that accepts a feature vector and returns a score. The default predictor is the z-score; an ONNX predictor (gated behind a -tags onnx build flag to avoid the binary-size cost in deployments that do not use it) accepts a pre-trained model and produces inference at sub-millisecond latency per observation.

The discipline is to prefer the model whose output the operator can reason about. A z-score is interpretable in the same vocabulary the operator already uses for statistical reasoning. A deep neural network is interpretable only via post-hoc tooling (SHAP values, LIME, integrated gradients) that adds operational complexity without proportional improvement in detection quality for the typical observability workload. The trade-off is examined in Cynthia Rudin's 2019 paper Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead, which remains the cleanest articulation of why interpretable models should be the default in operational contexts.

The Anti-Architecture: What Was Deliberately Not Built

A coherent architecture is defined as much by what it refuses as by what it includes. The platform deliberately does not include several patterns that appear in competing observability systems, and the absence is intentional in each case.

There is no client-side instrumentation library. The platform speaks OTLP, and OTLP is what every application emits via the OpenTelemetry SDKs that already exist for every major language. Building a competing SDK would be a make-work project whose only effect would be to fragment the ecosystem.

There is no built-in alerting expression language. Alert rules are expressed in PromQL or ObserveQL and evaluated by the alert-manager service against the platform's existing query engine. The alternative — a bespoke alert DSL — would duplicate query-engine functionality in a less general form, and would force operators to learn a third query syntax on top of the two they already know.

There is no proprietary dashboard format. Dashboards are stored as JSON in PostgreSQL JSONB columns, structured similarly enough to Grafana's dashboard JSON that import-and-export between the platform's native UI and Grafana is a matter of remapping panel queries, not of rewriting the dashboard from scratch.

There is no in-house container orchestrator. The platform deploys on Kubernetes, and operators who do not run Kubernetes are directed to Docker Compose for development and to the published Helm chart for production. Building an orchestrator would be a multi-year project orthogonal to the platform's actual value proposition.

The pattern is concentrate the differentiation, commoditise everything else. Each refusal preserves engineering attention for the layers where the platform's choices genuinely matter, and each refusal trusts the broader ecosystem to provide the layers it has already commoditised.

The Long Time Horizon: Engineering for Decade-Scale Operation

The systems that survive are the systems that are still legible to their successors. A platform built on substrates that are stable across decades — POSIX file APIs, HTTP, SQL, the Go language and its standard library, PostgreSQL, ClickHouse — will be operable by engineers who join the project years after its original authors have moved on. A platform built on substrates that turn over every two years — the JavaScript framework du jour, the cloud provider's proprietary database, the in-house RPC protocol — will require a continuous translation effort to remain operable.

The choice is rarely framed explicitly. It manifests as a series of small decisions: which library to import, which abstraction to introduce, which version of a dependency to pin. The cumulative effect of those decisions is what determines whether the system can be picked up by a new engineer in a week or in a quarter. The discipline is the same one Brian Kernighan and Rob Pike articulate in The Practice of Programming (Addison-Wesley 1999): simplicity is the price of correctness.

Conclusion: Observability as Honesty

In the end, the goal of an observability platform is honesty. The systems it observes will fail; the question is whether the platform's report of the failure is a faithful account or a comforting fiction. Every architectural decision examined here — bounded back-pressure, group-committed durability, multi-tenant isolation enforced at the database layer, sampling that prioritises anomalies, supervisors that quarantine rather than retry, storage that distinguishes hot from cold, query languages that fail loud on the unsupported case, visualisation that does not delegate the operator's eye to a third party — is a vote for honesty over convenience.

A platform that drops signals silently is dishonest. A platform that returns query results without an indication that those results are sampled is dishonest. A platform that delegates its visualisation layer to a vendor it does not control is dishonest about who owns the operator's attention. A platform whose own behaviour is opaque to its operators is dishonest about whether its reports about its customers can be trusted.

The architecture is the discipline. The discipline is the architecture. The system that survives is the one that admits its compromises in the source code, in the documentation, in the error messages, and in the metrics it exposes about itself — and that is operable by engineers who join the project long after its first authors have moved on.


References

  • OpenTelemetry Specification — https://opentelemetry.io/docs/specs/otel/
  • Sridharan, C. — Distributed Systems Observability (O'Reilly, 2018) — https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
  • Majors, Fong-Jones, Miranda — Observability Engineering (O'Reilly, 2022) — https://www.oreilly.com/library/view/observability-engineering/9781492076438/
  • Beyer, Jones, Petoff, Murphy — Site Reliability Engineering (Google / O'Reilly, 2016) — https://sre.google/sre-book/table-of-contents/
  • McKinley, D. — Choose Boring Technology (2015) — https://boringtechnology.club/