Blog · Observability

Logging in the enterprise

From scattered log files to searchable truth. What a robust log system has to deliver when a single citizen request travels through six services — and the gap between a tidy architecture sketch and a stack you can actually work with during an incident.

Why logging matters — and why it gets underestimated

Logs are the cheapest insurance against the most expensive moment in software: an incident whose cause nobody can reconstruct. Good logs cut the path from "alert" to "cause understood" from hours to minutes — bad ones stretch it to days.

In regulated industries a second layer comes on top: supervisors, data-protection officers and auditors want evidence. Not "we log that somehow", but who accessed which record at which time, which decision was taken, who approved it. Logs here aren't nice-to-have, they are mandatory evidence.

And yet, in projects, logs are often treated as the thing you "also take care of in the end". The first production incident lays the result bare: unstructured free-text output, scattered across ten hosts, no end-to-end request ID, no clock sync, no clear log levels. What remains is guesswork.

Debugging

Stack traces and state at the point of failure — not a re-creation on a developer's laptop two days later.

Compliance

Audit trail for GDPR, ISO 27001, BAIT, OZG: who accessed what, when, with what justification, from which system.

Performance

Make response times, slow queries and load spikes visible before they turn into a ticket at the service desk.

Security

Spot unusual access patterns, failed logins, brute-force attempts and privilege escalations early.

From monolith to microservice — why old-style logging falls short

In a monolith, logging was trivial. One application, one log file, one grep — and the error was cornered. With microservices that simplicity disappears, and so does the option to rely on pragmatism.

Example: a citizen request "book an appointment at the city office" typically traverses six services today — API gateway, auth service, appointment service, calendar backend, notification, audit log. Each service has its own log file, its own hosts, its own containers, its own timestamps with potentially different time zones. When the request gets stuck at step five with a 502, "grep all logs at once" is no longer an option — even if you had shell access to all ten containers.

Three consequences fall out of this, for any distributed system:

Logs must be centralised. Files scattered across ten containers are worthless during an incident. A central collection point — on-prem or as a service — is no longer optional.
Logs must be structured. Free text is not filterable. JSON is the standard, because it stays queryable without parser acrobatics.
Logs must be correlatable. Without an end-to-end request or trace ID that runs through every service, you get fragments, not a picture.

Rule of thumb

If the answer to "how do I find out what happened on this specific request?" takes longer than two minutes, the log system is not production-ready — no matter how pretty the dashboard.

The three pillars of observability

Logging is only part of the picture. Observability — the ability to reconstruct a system's internal state from its external signals — rests on three pillars that complement, not replace, each other.

Logs document concrete events: "user X started request Y at 14:32, it ended with error code Z". They are read reactively, when something needs clarifying.

Metrics are aggregated time series: "p95 response time over the last five minutes", "current CPU usage", "queue depth". They work proactively — thresholds trigger alerts long before anyone opens a log. Tools: Prometheus, Grafana Mimir, Datadog Metrics.

Traces follow a single request across service boundaries. They show not only that something was slow, but which service contributed which share of the total latency. Tools: OpenTelemetry (the standard for instrumentation), Jaeger or Tempo (for storage and visualisation).

Build only one pillar, and you fly blind during an incident: metrics show that something is off; logs show what happened; traces show where the bottleneck sits. A mature setup connects all three — ideally so you can click from an alert into the relevant trace, and from there land on the matching log entry.

Anatomy of a solid log entry and stack layout

A good log entry is a structured message, not a narrative. Format: JSON. Mandatory fields without which the entry is worthless in an incident:

{
  "timestamp": "2026-05-13T14:32:18.421Z",
  "level":     "error",
  "service":   "appointment-service",
  "host":      "appointment-pod-7c4b9",
  "requestID": "8a3f1c2e-94df-4ef8-9f98-9fbe9b9831c9",
  "userID":    "u-h8a9c2",
  "message":   "Database timeout in loadSlots",
  "context": {
    "service_type": "id_card",
    "duration_ms": 5023
  },
  "stack":     "Error: connect ETIMEDOUT 10.0.4.12:5432..."
}

timestamp — ISO 8601 with milliseconds, always in UTC. Local time is a mess in a distributed system.
level — debug · info · warn · error. More levels just confuse. Production typically runs at info or higher.
service and host — where the entry was emitted. In a container world, add the pod name or container ID.
requestID — a UUID propagated through every service. The single most valuable lever during an incident (see the practice section).
userID — pseudonymised (hash, opaque ID). Never the email address in plain text (see GDPR).
message and context — the actual message plus structured metadata. Ten fields too many beats two fields too few — storage is cheaper than the next incident.

The stack — four layers

A central log stack consists of four layers, regardless of the specific tools:

Producer — the application
Every application writes structured JSON logs to stdout or a file. Libraries: Winston or Pino (Node.js), Logback or Log4j2 (Java), structlog (Python), Serilog (.NET). For HTTP access logs in Express, Morgan complements a logger such as Winston — Morgan captures incoming requests, Winston handles everything domain-related.
Collector — the shipper
Collects logs from all hosts and containers, parses, filters and enriches them. Logstash (in the ELK stack), Promtail (in the Loki stack), Fluent Bit or Vector (vendor-neutral). On Kubernetes the collector typically runs as a DaemonSet — one container per node, scraping the logs of all pods.
Storage — the searchable archive
Elasticsearch indexes every field; powerful, but memory- and storage-hungry. Grafana Loki indexes only the labels (e.g. service, level) and stores the actual log text compressed — cheaper to operate, but with less ad-hoc full-text search. Cloud alternatives: Datadog, Splunk, AWS CloudWatch Logs.
Visualisation — the frontend
Kibana is the frontend for Elasticsearch. Grafana is more universal — it can query Loki, Elasticsearch, Prometheus and many more sources side by side. In setups that want logs, metrics and traces on the same screen, Grafana is usually the better choice, simply because all three pillars live in one UI.

Stack comparison: ELK vs. Grafana Loki

Two open-source stacks have established themselves. Which one fits depends mainly on search requirements, volume and budget.

ELK stack — Elasticsearch, Logstash, Kibana

since 2010 · Elastic

When to use it

When ad-hoc full-text search over large log volumes is business-critical — e.g. security forensics, compliance research, fraud analysis. ELK shines when you don't know in advance what you'll be searching for.

How it works

Logstash receives logs (e.g. via TCP or Beats), parses, filters and writes them to Elasticsearch. There every field is indexed — full-text search, aggregations and KQL queries are the payoff. Kibana provides the web frontend with dashboards, discover view and alerting.

Strengths

powerful full-text search · rich aggregations · broad plugin ecosystem · standard in many enterprise environments

Weaknesses

high RAM and storage footprint · complex operations (cluster management, index lifecycle) · licensing questions for advanced features (SIEM, ML)

Grafana Loki + Promtail + Grafana

since 2018 · Grafana Labs

When to use it

When logs are queried mostly along a few well-known dimensions (service, level, time window), and full-text search usually happens within those filters. Loki is the cost-effective choice for mid-sized stacks that build their observability around Grafana.

How it works

Promtail tails local log files, tags each entry with labels (service=appointment, level=error) and ships it to Loki. Loki indexes only the labels; the log text itself is stored, compressed, in an object store (S3, MinIO). Queries use LogQL, Loki's query language.

Strengths

much cheaper to operate · same Grafana UI for metrics, traces and logs · simpler operations profile · Kubernetes-native

Weaknesses

full-text search beyond labels is slower · less mature reporting and SIEM features · younger ecosystem

Cloud and SaaS alternatives

For teams without the capacity to run a stack themselves, Datadog, Splunk Cloud, Sumo Logic, AWS CloudWatch Logs or Azure Monitor deliver storage and visualisation as a service. The trade-off: higher per-GB running cost and less sovereignty over the data — which can be a problem for sensitive logs (personal data, banking data). In tightly regulated environments we always weigh the deployment model against the relevant data classification before recommending the SaaS path.

Practice: correlation, GDPR, retention

Three disciplines that are easy to forget at setup time, but make all the difference once the system runs.

Request correlation via trace ID

The one technique that really carries in a distributed system: each request gets a unique ID at the entry point, which travels through every service. Three steps:

Generate the ID at the entry point
The API gateway generates a UUID for every incoming request and puts it into the request context (HTTP header X-Request-ID or MDC in Java). If the caller already provided an ID, it's adopted — so traces extend across system boundaries.

Propagate the ID

Every internal call (HTTP, queue, gRPC) carries the ID in a header or message property. OpenTelemetry, Spring Cloud Sleuth or Micrometer Tracing do this automatically once they're wired in — which dramatically reduces the manual effort.

// API gateway: set the ID at entry
app.use((req, res, next) => {
  req.requestID = req.headers['x-request-id'] ?? randomUUID();
  next();
});

// Pass it along on the call to the persistence service
const newOrder = await http.post(ORDER_URL, req.body, {
  headers: { 'x-request-id': req.requestID }
});

Write the ID into every log entry
Every logger reads the ID from the context and writes it as a requestID field into each JSON entry. In Kibana or Grafana, a search requestID:"8a3f…" then surfaces every entry of that request across all services in chronological order — the most expensive question in operations becomes a matter of seconds.

GDPR — what is allowed in logs

Logs are a data-processing activity under GDPR. Whoever writes personal data into logs has to justify it (purpose limitation), be able to delete it (right to be forgotten) and keep it to the minimum (data minimisation). Concrete rules that hold up in regulated projects:

No plaintext names, no passwords, no payment details in logs. Ever. Not even in the error message "wrong password for user@company.com" — that's exactly the entry an auditor will pick on.
Pseudonymise user IDs. An opaque hash or an internal ID, not the email address. A separate, access-controlled mapping table resolves the pseudonym during an incident — under four-eyes review.
Audit each field for necessity. Date of birth, address, IBAN belong in a log only when a concrete purpose (e.g. auditing a booking) requires them — and only in the service that owns that purpose.
Define retention explicitly. 30 days for debug logs, 90 days for access logs, one year for security-relevant audit logs is a typical frame. The exact numbers depend on legal retention requirements and the data-protection officer — what matters is that they are defined, not implicit.
Encryption in transit and at rest. TLS from producer to storage; encryption at rest. In a cloud setup the question of who controls the keys is part of the decision.

Volume and cost — the underestimated lever

Logs grow. A single microservice with a modest 50 requests per second and a 1 KB JSON entry per request produces roughly 4.3 GB per day, about 130 GB per month. Multiply by ten services — and the log stack is suddenly the most expensive line item in the cluster. Three levers, in order of impact:

Use log levels strictly. debug only in development, info only for events that matter in audits. Production usually runs at warn and up; levels can be toggled per service via environment variables, no redeploy.
Sampling for high-volume events. If a health check fires five times per second, 1 in 100 is plenty. For errors: never sample — every error counts.
Tiered retention. Hot (searchable): 7 days. Warm (compressed, slower): 30 days. Cold (object archive, e.g. S3 Glacier): as legally required. That pushes 80 % of the volume into the cheapest storage tier without losing analysability.

Recommendation

In projects we typically start with Grafana Loki + Promtail + Grafana, plus OpenTelemetry for trace instrumentation — cheap to run, Kubernetes-native, and one UI shows logs, metrics and traces. Where full-text search or SIEM features are required, we run Elasticsearch alongside for selected log categories — rather than forcing one stack to do everything, which usually ends up too expensive for one half and too thin for the other.

Log audit or stack consolidation?

We work with your team on your log architecture — structure, correlation, GDPR conformance, cost profile. The outcome: a concrete action plan, tuned to the size and maturity of your platform.

Schedule a call

Logging in the enterprise

Why logging matters — and why it gets underestimated

Debugging

Compliance

Performance

Security

From monolith to microservice — why old-style logging falls short

The three pillars of observability

Anatomy of a solid log entry and stack layout

The stack — four layers

Producer — the application

Collector — the shipper

Storage — the searchable archive

Visualisation — the frontend

Stack comparison: ELK vs. Grafana Loki

ELK stack — Elasticsearch, Logstash, Kibana

When to use it

How it works

Strengths

Weaknesses

Grafana Loki + Promtail + Grafana

When to use it

How it works

Strengths

Weaknesses

Cloud and SaaS alternatives

Practice: correlation, GDPR, retention

Request correlation via trace ID

Generate the ID at the entry point

Propagate the ID

Write the ID into every log entry

GDPR — what is allowed in logs

Volume and cost — the underestimated lever

Log audit or stack consolidation?