Why logging matters — and why it gets underestimated
Logs are the cheapest insurance against the most expensive moment in software: an incident whose cause nobody can reconstruct. Good logs cut the path from "alert" to "cause understood" from hours to minutes — bad ones stretch it to days.
In regulated industries a second layer comes on top: supervisors, data-protection officers and auditors want evidence. Not "we log that somehow", but who accessed which record at which time, which decision was taken, who approved it. Logs here aren't nice-to-have, they are mandatory evidence.
And yet, in projects, logs are often treated as the thing you "also take care of in the end". The first production incident lays the result bare: unstructured free-text output, scattered across ten hosts, no end-to-end request ID, no clock sync, no clear log levels. What remains is guesswork.
Debugging
Stack traces and state at the point of failure — not a re-creation on a developer's laptop two days later.
Compliance
Audit trail for GDPR, ISO 27001, BAIT, OZG: who accessed what, when, with what justification, from which system.
Performance
Make response times, slow queries and load spikes visible before they turn into a ticket at the service desk.
Security
Spot unusual access patterns, failed logins, brute-force attempts and privilege escalations early.
From monolith to microservice — why old-style logging falls short
In a monolith, logging was trivial. One application, one log file, one grep — and the error was cornered. With microservices that simplicity disappears, and so does the option to rely on pragmatism.
Example: a citizen request "book an appointment at the city office" typically traverses six services today — API gateway, auth service, appointment service, calendar backend, notification, audit log. Each service has its own log file, its own hosts, its own containers, its own timestamps with potentially different time zones. When the request gets stuck at step five with a 502, "grep all logs at once" is no longer an option — even if you had shell access to all ten containers.
Three consequences fall out of this, for any distributed system:
- Logs must be centralised. Files scattered across ten containers are worthless during an incident. A central collection point — on-prem or as a service — is no longer optional.
- Logs must be structured. Free text is not filterable. JSON is the standard, because it stays queryable without parser acrobatics.
- Logs must be correlatable. Without an end-to-end request or trace ID that runs through every service, you get fragments, not a picture.
Rule of thumbIf the answer to "how do I find out what happened on this specific request?" takes longer than two minutes, the log system is not production-ready — no matter how pretty the dashboard.
The three pillars of observability
Logging is only part of the picture. Observability — the ability to reconstruct a system's internal state from its external signals — rests on three pillars that complement, not replace, each other.
Logs document concrete events: "user X started request Y at 14:32, it ended with error code Z". They are read reactively, when something needs clarifying.
Metrics are aggregated time series: "p95 response time over the last five minutes", "current CPU usage", "queue depth". They work proactively — thresholds trigger alerts long before anyone opens a log. Tools: Prometheus, Grafana Mimir, Datadog Metrics.
Traces follow a single request across service boundaries. They show not only that something was slow, but which service contributed which share of the total latency. Tools: OpenTelemetry (the standard for instrumentation), Jaeger or Tempo (for storage and visualisation).
Build only one pillar, and you fly blind during an incident: metrics show that something is off; logs show what happened; traces show where the bottleneck sits. A mature setup connects all three — ideally so you can click from an alert into the relevant trace, and from there land on the matching log entry.
Anatomy of a solid log entry and stack layout
A good log entry is a structured message, not a narrative. Format: JSON. Mandatory fields without which the entry is worthless in an incident:
{
"timestamp": "2026-05-13T14:32:18.421Z",
"level": "error",
"service": "appointment-service",
"host": "appointment-pod-7c4b9",
"requestID": "8a3f1c2e-94df-4ef8-9f98-9fbe9b9831c9",
"userID": "u-h8a9c2",
"message": "Database timeout in loadSlots",
"context": {
"service_type": "id_card",
"duration_ms": 5023
},
"stack": "Error: connect ETIMEDOUT 10.0.4.12:5432..."
}timestamp — ISO 8601 with milliseconds, always in UTC. Local time is a mess in a distributed system.level — debug · info · warn · error. More levels just confuse. Production typically runs at info or higher.service and host — where the entry was emitted. In a container world, add the pod name or container ID.requestID — a UUID propagated through every service. The single most valuable lever during an incident (see the practice section).userID — pseudonymised (hash, opaque ID). Never the email address in plain text (see GDPR).message and context — the actual message plus structured metadata. Ten fields too many beats two fields too few — storage is cheaper than the next incident.
The stack — four layers
A central log stack consists of four layers, regardless of the specific tools:
Producer — the application
Every application writes structured JSON logs to stdout or a file. Libraries: Winston or Pino (Node.js), Logback or Log4j2 (Java), structlog (Python), Serilog (.NET). For HTTP access logs in Express, Morgan complements a logger such as Winston — Morgan captures incoming requests, Winston handles everything domain-related.
Collector — the shipper
Collects logs from all hosts and containers, parses, filters and enriches them. Logstash (in the ELK stack), Promtail (in the Loki stack), Fluent Bit or Vector (vendor-neutral). On Kubernetes the collector typically runs as a DaemonSet — one container per node, scraping the logs of all pods.
Storage — the searchable archive
Elasticsearch indexes every field; powerful, but memory- and storage-hungry. Grafana Loki indexes only the labels (e.g. service, level) and stores the actual log text compressed — cheaper to operate, but with less ad-hoc full-text search. Cloud alternatives: Datadog, Splunk, AWS CloudWatch Logs.
Visualisation — the frontend
Kibana is the frontend for Elasticsearch. Grafana is more universal — it can query Loki, Elasticsearch, Prometheus and many more sources side by side. In setups that want logs, metrics and traces on the same screen, Grafana is usually the better choice, simply because all three pillars live in one UI.
Stack comparison: ELK vs. Grafana Loki
Two open-source stacks have established themselves. Which one fits depends mainly on search requirements, volume and budget.
ELK stack — Elasticsearch, Logstash, Kibana
since 2010 · ElasticWhen to use it
When ad-hoc full-text search over large log volumes is business-critical — e.g. security forensics, compliance research, fraud analysis. ELK shines when you don't know in advance what you'll be searching for.
How it works
Logstash receives logs (e.g. via TCP or Beats), parses, filters and writes them to Elasticsearch. There every field is indexed — full-text search, aggregations and KQL queries are the payoff. Kibana provides the web frontend with dashboards, discover view and alerting.
Strengths
- powerful full-text search · rich aggregations · broad plugin ecosystem · standard in many enterprise environments
Weaknesses
- high RAM and storage footprint · complex operations (cluster management, index lifecycle) · licensing questions for advanced features (SIEM, ML)
Grafana Loki + Promtail + Grafana
since 2018 · Grafana LabsWhen to use it
When logs are queried mostly along a few well-known dimensions (service, level, time window), and full-text search usually happens within those filters. Loki is the cost-effective choice for mid-sized stacks that build their observability around Grafana.
How it works
Promtail tails local log files, tags each entry with labels (service=appointment, level=error) and ships it to Loki. Loki indexes only the labels; the log text itself is stored, compressed, in an object store (S3, MinIO). Queries use LogQL, Loki's query language.
Strengths
- much cheaper to operate · same Grafana UI for metrics, traces and logs · simpler operations profile · Kubernetes-native
Weaknesses
- full-text search beyond labels is slower · less mature reporting and SIEM features · younger ecosystem
Cloud and SaaS alternatives
For teams without the capacity to run a stack themselves, Datadog, Splunk Cloud, Sumo Logic, AWS CloudWatch Logs or Azure Monitor deliver storage and visualisation as a service. The trade-off: higher per-GB running cost and less sovereignty over the data — which can be a problem for sensitive logs (personal data, banking data). In tightly regulated environments we always weigh the deployment model against the relevant data classification before recommending the SaaS path.
Practice: correlation, GDPR, retention
Three disciplines that are easy to forget at setup time, but make all the difference once the system runs.
Request correlation via trace ID
The one technique that really carries in a distributed system: each request gets a unique ID at the entry point, which travels through every service. Three steps:
Generate the ID at the entry point
The API gateway generates a UUID for every incoming request and puts it into the request context (HTTP header X-Request-ID or MDC in Java). If the caller already provided an ID, it's adopted — so traces extend across system boundaries.
Propagate the ID
Every internal call (HTTP, queue, gRPC) carries the ID in a header or message property. OpenTelemetry, Spring Cloud Sleuth or Micrometer Tracing do this automatically once they're wired in — which dramatically reduces the manual effort.
app.use((req, res, next) => {
req.requestID = req.headers['x-request-id'] ?? randomUUID();
next();
});
const newOrder = await http.post(ORDER_URL, req.body, {
headers: { 'x-request-id': req.requestID }
});Write the ID into every log entry
Every logger reads the ID from the context and writes it as a requestID field into each JSON entry. In Kibana or Grafana, a search requestID:"8a3f…" then surfaces every entry of that request across all services in chronological order — the most expensive question in operations becomes a matter of seconds.
GDPR — what is allowed in logs
Logs are a data-processing activity under GDPR. Whoever writes personal data into logs has to justify it (purpose limitation), be able to delete it (right to be forgotten) and keep it to the minimum (data minimisation). Concrete rules that hold up in regulated projects:
- No plaintext names, no passwords, no payment details in logs. Ever. Not even in the error message "wrong password for user@company.com" — that's exactly the entry an auditor will pick on.
- Pseudonymise user IDs. An opaque hash or an internal ID, not the email address. A separate, access-controlled mapping table resolves the pseudonym during an incident — under four-eyes review.
- Audit each field for necessity. Date of birth, address, IBAN belong in a log only when a concrete purpose (e.g. auditing a booking) requires them — and only in the service that owns that purpose.
- Define retention explicitly. 30 days for debug logs, 90 days for access logs, one year for security-relevant audit logs is a typical frame. The exact numbers depend on legal retention requirements and the data-protection officer — what matters is that they are defined, not implicit.
- Encryption in transit and at rest. TLS from producer to storage; encryption at rest. In a cloud setup the question of who controls the keys is part of the decision.
Volume and cost — the underestimated lever
Logs grow. A single microservice with a modest 50 requests per second and a 1 KB JSON entry per request produces roughly 4.3 GB per day, about 130 GB per month. Multiply by ten services — and the log stack is suddenly the most expensive line item in the cluster. Three levers, in order of impact:
- Use log levels strictly.
debug only in development, info only for events that matter in audits. Production usually runs at warn and up; levels can be toggled per service via environment variables, no redeploy. - Sampling for high-volume events. If a health check fires five times per second, 1 in 100 is plenty. For errors: never sample — every error counts.
- Tiered retention. Hot (searchable): 7 days. Warm (compressed, slower): 30 days. Cold (object archive, e.g. S3 Glacier): as legally required. That pushes 80 % of the volume into the cheapest storage tier without losing analysability.
RecommendationIn projects we typically start with Grafana Loki + Promtail + Grafana, plus OpenTelemetry for trace instrumentation — cheap to run, Kubernetes-native, and one UI shows logs, metrics and traces. Where full-text search or SIEM features are required, we run Elasticsearch alongside for selected log categories — rather than forcing one stack to do everything, which usually ends up too expensive for one half and too thin for the other.