Blog · Observability

Server monitoring with Gatus

When an endpoint stops responding, a certificate is about to expire, or an API suddenly returns an empty body, you want to know before your users do. Gatus is a lightweight open-source tool that answers exactly that question — no agents on the target systems, no time-series database, no cloud lock-in. A single binary, one YAML file, a built-in status page.

What it actually does — active probing, not agent telemetry

Gatus is an open-source tool written in Go for synthetic endpoint monitoring. It checks at regular intervals, from the outside, whether a defined endpoint is reachable, returns the right answer, and behaves as expected — and it turns those checks into a status page that can be shared internally or publicly. When something goes wrong, it fires alerts through the usual channels.

The proper framing matters: Gatus is not infrastructure monitoring. It does not replace Zabbix or Prometheus. It does not collect CPU, RAM, or disk metrics from the host, it does not scrape metrics endpoints, and it does not write time series. What Gatus delivers is a time-aggregated "responds — does not respond" per endpoint, enriched with response time, selected body content, and certificate details.

That deliberate restraint is the actual strength. If you operate 200 services across three data centres and want to know whether the customer-facing APIs are reachable from the outside, Gatus gives you a productive system in hours. The same question solved with a classical monitoring stack takes days to weeks. The tool sits squarely in that gap: between the heavyweight infrastructure suites (Zabbix, Nagios, Icinga) and the pure visualisation platforms (Grafana).

Licence and maturity are quietly solid: Apache 2.0, actively maintained, regular releases, a growing user base. No commercial backing, but no hobby repo either — the kind of project you can use in regulated industries without a second thought.

Architecture and data flow

Gatus runs as a single Go binary. No agent, no separate collector, no configuration server. All configuration lives in one YAML file, read at startup and reloaded automatically when it changes. Three components work together.

A scheduler runs a probe loop for every configured endpoint on its configured interval. A conditions engine evaluates the endpoint's response against the declared expectations — status code, body content, response time, certificate lifetime. A storage backend keeps the history of results: in memory (volatile), in a SQLite file (the common choice for single instances), or in a PostgreSQL database (for highly available setups).

On the output side, three consumers sit on equal footing: the built-in web UI, which doubles as the status page; a REST API for external integrations; and the alerting fan-out to Slack, Microsoft Teams, PagerDuty, Opsgenie, email, webhooks, and a dozen further channels. The figure below shows the flow at a glance — from the configuration on the left, through the active probes in the middle, to aggregation and display on the right.

Figure 1 — Data flow in Gatus: a YAML configuration drives the scheduler; the scheduler actively probes the monitored endpoints; responses pass through the conditions engine, are aggregated in storage, and feed the web UI, the REST API, and the alerting channels. All core components live in one Go binary.

Three properties of this architecture matter most in practice. First: everything runs in one process. No deployment diagram with five components, no Helm chart with twelve sub-charts. You start one binary, mount a YAML file, and the system runs — as a Docker container, in Kubernetes with a plain Deployment, behind a reverse proxy, or directly on a small Linux VM.

Second: probing is active and external. Gatus queries the endpoint rather than being fed by an agent on the target system. Two consequences follow. You measure exactly what your users experience — network, TLS handshake, reverse proxy, all included. And you do not have to install software on the monitored system. For external third-party systems — government APIs, banking web services, payment providers — that is the only viable option anyway.

Third: configuration is code. The YAML file lives in a Git repository, is changed via pull request, validated in CI, and rolled out to the running system via GitOps. Reviewable, versioned, reproducible. Anyone working in critical infrastructure or regulated industries recognises the value of that property immediately.

Probes and the conditions DSL

Gatus ships a broad set of probe types that cover most real-world cases: HTTP and HTTPS, TCP, UDP, ICMP (ping), DNS, TLS and STARTTLS certificate checks, SSH, gRPC, WebSocket. That selection covers most of what counts as a "checkable endpoint" in an enterprise environment — from a public REST API to an internal LDAP server.

More interesting than the raw probe list is the conditions DSL. Instead of checking only whether an endpoint returns HTTP 200, you can formulate structured expectations.

endpoints:
  - name: citizen-portal-status
    url: https://portal.exampletown.gov/api/health
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 800"
      - "[BODY].status == UP"
      - "[BODY].database.status == UP"
      - "[CERTIFICATE_EXPIRATION] > 240h"

What happens here is more than a classical liveness check. The endpoint is checked for response ([STATUS]), for whether the response arrives within an acceptable time ([RESPONSE_TIME]), for whether the returned JSON body contains the correct domain values ([BODY]…), and for whether the TLS certificate is still valid long enough — [CERTIFICATE_EXPIRATION] > 240h, i.e. at least ten days remaining. If any of these conditions fails, the endpoint is treated as not healthy.

This DSL replaces a whole class of separate tools. Certificate-expiry monitoring, health-check validation, simple functional tests against production APIs — all in one system, with one syntax. For more complex cases, helpers such as len(...) for length checks, pat(...) for pattern matches, and has(...) for structural tests are available.

A second example shows that pure network probes get the same convenience:

endpoints:
  - name: keycloak-ldap-bridge
    url: tcp://ldap.intranet.local:636
    interval: 5m
    conditions:
      - "[CONNECTED] == true"
      - "[RESPONSE_TIME] < 1500"

  - name: portal-certificate
    url: https://portal.exampletown.gov
    interval: 1h
    conditions:
      - "[CERTIFICATE_EXPIRATION] > 336h"

In the first case, Gatus leaves the HTTP world and probes a TCP/TLS endpoint — the conditions work unchanged. In the second case, an hourly run is enough to check just the remaining certificate lifetime; an expiring certificate becomes visible two weeks in advance, long before a user sees the browser warning.

Practical note

Organise endpoints into groups — one group for "External interfaces", one for "Internal back-office services", one for "Citizen portal". The status page becomes readable that way, and you can later decide which group is visible publicly and which is not. Groups are a single configuration line in Gatus, not a refactoring exercise.

Alerting, storage, and the status page

Three areas where Gatus goes beyond pure probing — and where its pragmatic defaults pay off.

Alerting

For each endpoint and each channel, a threshold is configured: a failure-threshold (e.g. alert only after three consecutive failures) and a success-threshold for the recovery notification. That mechanism reliably suppresses the infamous flapping — a short network hiccup will not turn into a midnight text message. The channels range from Slack, Microsoft Teams, Mattermost, and Discord through email (SMTP), PagerDuty, Opsgenie, Telegram, and Matrix to Twilio (SMS), Pushover, Gotify, Ntfy, and generic webhooks. The latter open the door to n8n, Zapier, or custom endpoints — for example, to automatically open a ticket in an internal ITSM.

Storage

Three backends are available. For quick evaluation in development, the in-memory store is enough; after a restart, the history is gone. For productive single instances, SQLite is the usual recommendation — a file that lives next to the binary, with no separate database server. If you need high availability or multiple replicas, you attach PostgreSQL. Storage choice is one YAML line, not an architectural decision.

Status page

The bundled web UI doubles as the status page. Endpoints can be organised into groups and marked visible or hidden per group. A public status page of the kind SaaS vendors run at status.… is a configuration question, not a separate product. For regulated industries, that matters: the status page runs self-hosted, on your own infrastructure, with no data flowing to a third-party vendor. Anyone running a status page in the public sector or in finance knows the otherwise tedious data-protection discussions with US-based SaaS providers.

Gatus, Zabbix, and Grafana — three tools, three questions

Whoever sees Gatus for the first time almost always asks how it relates to the established heavyweights. The clean answer: the three tools answer different questions and are complementary, not competitors.

Zabbix answers the question: "How are my servers, networks, and services doing from the inside?" It is primarily agent-based. On each monitored system, a Zabbix agent pushes metrics to the central server (or, alternatively, the server polls). CPU load, RAM, disk usage, SNMP data from network hardware, log file contents, trigger logic in a dedicated language. The system stores metrics in its own database, has a complete web front end, and can auto-discover hosts in large networks. Zabbix is powerful — and the effort matches the power. A productive rollout is measured in weeks, not hours.

Grafana answers the question: "How do I visualise and correlate metrics, logs, and traces that other systems collect?" It is primarily a visualisation platform. Its data sources are Prometheus (metrics), Loki (logs), Tempo (traces), Influx, or Elasticsearch. Grafana itself does not probe — it paints dashboards from what other components have collected. (Grafana Synthetic Monitoring and k6 Cloud are separate products that add the probing layer, usually as a cloud service.)

Gatus answers the question: "Are my endpoints reachable and correct from a user's perspective — and where do I show that?" It probes actively, from the outside, against defined URLs and hosts. It has no time-series view, no correlation dashboards, no host agents. It has, instead, a status page, a declarative DSL, and a footprint that runs on a Raspberry Pi.

Side by side — Gatus, Zabbix, Grafana stack

Gatus

Main purpose: endpoint health, status page
Data collection: active external probe
Data model: up/down + response time
Configuration: YAML, declarative
Time-to-value: hours
Footprint: typically < 50 MB RAM

Zabbix

Main purpose: host and network metrics
Data collection: agent push / server pull
Data model: time series, triggers
Configuration: web UI, templates, macros
Time-to-value: weeks
Footprint: multiple GB as full stack

Grafana (+ Prometheus / Loki)

Main purpose: visualisation, correlation
Data collection: pull from metrics endpoints
Data model: time series + logs + traces
Configuration: YAML (Prometheus), UI (Grafana)
Time-to-value: days
Footprint: multiple GB as full stack

A combined setup is the right answer in many projects: Zabbix or Prometheus collect infrastructure metrics, Grafana visualises, Gatus answers from the outside the simple question "is it up?" and runs the public status page. The three tools overlap only at the narrow point "HTTP health check" — and there, whichever is faster to deploy and ships a status page along with it wins.

When Gatus fits — and when it doesn't

Gatus is not a universal tool — and that is precisely what makes it useful. The following criteria help place the method in a concrete project.

Fits when …

a concrete set of endpoints — your own APIs, third-party services, TLS certificates, DNS entries — has to be monitored from a user's perspective, without installing an agent on each target system;
a public or internal status page is desired, without an external SaaS provider and without building a custom front end;
configuration should live as code in a repository — GitOps, pull-request reviews, CI validation;
a small, regulated stack is needed that can be operated self-hosted, with no data going to third-party providers;
time-to-value is more important than maximum feature depth: a running system within an hour, a consolidated setup within a week.

Less suitable when …

you need deep host telemetry — CPU profiles, RAM trends, disk capacity planning, network flows. That is Zabbix or Prometheus territory.
you want to correlate metrics over time, with dashboards, drill-downs, and SLO calculations. That is Grafana's domain.
you need a multi-region view — Gatus probes from one network location. Answering "how does my API look from Tokyo, Frankfurt, and São Paulo?" needs multiple instances plus aggregation, or a purpose-built SaaS product.
very large setups with several thousand endpoints on short intervals are planned — the scaling limit is not formally documented, but beyond a few hundred endpoints, load measurement starts to pay off.

In most of the regulated industries we work in — municipal citizen portals, insurance APIs, banking interfaces — Gatus fits precisely the gap left between Zabbix and Grafana: the outside view of the services your users actually experience. In the second part of this series we revisit the same territory with Zabbix and Grafana — the heavyweight alternative for deeper infrastructure visibility and for setups where probing is only one part of the story.

Monitoring setup or stack consolidation?

We review your monitoring landscape together with your team — endpoint coverage, status pages, alerting discipline, cost structure, tool consolidation. The result: a concrete action plan, tailored to the size and maturity of your platform.

Arrange a conversation