Blog · Observability

Server monitoring with Zabbix and Grafana

The heavyweight counterpart to lean active probing: an agent-based server stack that collects infrastructure metrics in depth, retains them for years, and surfaces them in a visualisation platform that has long become the de-facto standard. A practical guide covering architecture, communication, step-by-step installation, templates, scaling, and security — the planned second part following our article on Gatus.

Positioning — the heavyweight counterpart

In the first part of this series we introduced Gatus as the tool that checks from the outside whether an endpoint responds, whether a certificate is still valid, and whether a JSON body contains the right fields. A single Go binary, one YAML file, productive within an hour. What Gatus deliberately does not deliver — host metrics, long-term time series, correlation dashboards, SNMP discovery — is exactly the gap filled by the combination of Zabbix and Grafana.

Zabbix answers the question "how are my servers, networks, and services doing from the inside?" It is a full infrastructure-monitoring system with its own database, its own front end, its own trigger language, agent-based data collection, SNMP support, and auto-discovery for large networks. Grafana complements it with a visualisation layer that has long become a standard — heterogeneous data sources displayed in a consistent dashboard, with alerting, annotations, variables, and drill-downs.

The two tools are not strictly coupled. Zabbix ships its own web front end, in which problems, triggers, and hosts can be viewed alongside time-series graphs. If you only use Zabbix, the bundled front end is enough. The moment data from multiple sources — Zabbix, Prometheus, PostgreSQL, an in-house reporting warehouse — has to flow into a shared view, Grafana becomes the natural meeting point. That combination is the focus of this article.

An honest warning up front: the stack is powerful, but it is also substantial. A productive rollout is measured in days to weeks, not hours. The templating model takes getting used to, the trigger syntax has its quirks, and without proper database sizing the history table grows beyond comprehension within a few months. Used right, the tool delivers a depth of telemetry that an endpoint checker, by design, cannot match.

Components and their roles

Before describing the installation, an overview of the components involved is useful — what they do, what they do not do, and what role they play in the bigger picture.

Zabbix Server

The central process. It receives data from agents and proxies, evaluates triggers, writes values into the database, executes actions (escalations, notifications), and exposes a JSON-RPC API via the front end. A single server process is enough for several thousand monitored hosts; beyond that, proxies or an HA setup come into play (see "Scaling").

Database

Zabbix requires a relational database — MySQL/MariaDB, PostgreSQL, or Oracle. PostgreSQL is the default recommendation in most new projects, because the Zabbix team supports the TimescaleDB extension, which provides a partitioned and heavily compressed variant that is far more efficient for long history retention than a classical table. The database stores configuration, history (raw values), and trends (aggregated hourly and daily values).

Zabbix Frontend

A PHP front end running inside an Nginx or Apache web server, working against the database. It is the classical operator interface: host management, triggers, templates, maps, reports. If you work exclusively in Grafana, you still cannot drop the front end — it remains the tool with which the monitoring configuration is maintained.

Zabbix Agent

On each monitored host runs a small process (Zabbix Agent 2 in the current generation, written in Go) that collects metrics: CPU load, memory, disk I/O, network counters, file-system usage, process lists, log files. The agent supports two modes: passive (the server polls the agent every few seconds) and active (the agent sends its values to the server on its own as it produces them). Active is the default recommendation — robust against firewall topologies and easier on the server.

Zabbix Proxy

Optional, but in larger or geographically distributed setups almost always present. A proxy collects data from agents in one network segment and forwards them in bulk to the central server. Three typical use cases: a site with limited bandwidth (bundling saves traffic), a DMZ without direct server access (the proxy is the only component allowed to reach inwards), or load offload (the proxy relieves the server of polling overhead).

Grafana

A standalone application that uses the Zabbix front end (via its JSON-RPC API) or, directly, the Zabbix database as a data source. It renders dashboards, adds annotations, defines alert rules, and sends notifications through the usual channels (Slack, Teams, email, PagerDuty). Grafana is written in Go and TypeScript, runs as a single process, and its installation is markedly leaner than the Zabbix server's.

Communication and ports

The most important step before any installation is understanding who speaks to whom on which port. In setups with firewalls, DMZs, or strict network segmentation, the rollout fails more often due to a missing firewall rule than to a configuration mistake in the tool itself.

Figure 1 — Components and communication paths: Zabbix agents send their values over TCP 10050/10051 either directly to the server or via a proxy. The server writes to PostgreSQL and exposes a JSON-RPC API used by the front end and Grafana. Browser access runs over HTTPS.

Which ports are actually needed

TCP 10050 — server/proxy → agent (passive mode). The server queries the agent actively. On the host, the agent must be reachable as a TCP listener on 10050.
TCP 10051 — agent → server/proxy (active mode) and proxy → server. The agent initiates the connection itself. The better choice in restrictive networks, because the agent connects "outward" and does not have to be reachable "from outside".
TCP 5432 — Zabbix server → PostgreSQL (default Postgres port). With MySQL/MariaDB, TCP 3306 instead.
TCP 443 — browser → front end and browser → Grafana, each over HTTPS. Behind a reverse proxy (Nginx, Traefik) with TLS termination is the common pattern.
TCP 443 — Grafana → Zabbix front end (REST API). The same HTTPS endpoint that operators use to access the front end.

Active vs. passive — the most important design decision

In new setups, practically all agents operate actively. Three reasons: (a) the firewall only needs to allow outbound connections to the server, not inbound connections to every host. (b) the server is relieved of the polling overhead. (c) active items still collect values when the server was briefly unreachable — the agent buffers locally and replays the queue. Passive items have their place when a specific value is to be queried ad hoc, or when server-pull logic is preferred for organisational reasons, but as a default for 95 % of hosts, active is the right choice.

Installation, step by step

The following guide describes a production-grade single-server installation on Ubuntu 24.04 LTS with Zabbix 7.x, PostgreSQL as the database, and Grafana 11.x as the visualisation platform. All commands run as root or with sudo. Before starting, host name, FQDN, and time synchronisation (NTP) should be in place.

Database and Zabbix server

Step 1 — add the repository and install the packages. Zabbix maintains its own package sources for every supported distribution; the project website has a generator that produces the right commands for any version. For Ubuntu 24.04 with Zabbix 7.0:

wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest+ubuntu24.04_all.deb
dpkg -i zabbix-release_latest+ubuntu24.04_all.deb
apt update

apt install -y zabbix-server-pgsql zabbix-frontend-php \
                zabbix-nginx-conf zabbix-sql-scripts \
                zabbix-agent2 postgresql

Step 2 — create the PostgreSQL database and user. As the Postgres system user:

sudo -u postgres createuser --pwprompt zabbix
sudo -u postgres createdb -O zabbix zabbix

Step 3 — import the initial schema. The Zabbix package ships a compressed SQL dump containing the table structure and the default templates:

zcat /usr/share/zabbix-sql-scripts/postgresql/server.sql.gz \
  | sudo -u zabbix psql zabbix

Step 4 — optional, but strongly recommended for more than a couple of dozen hosts: enable the TimescaleDB extension and partition the history schema. TimescaleDB is a Postgres extension that turns the history and trends tables into hypertables — with automatic partitioning by time and compression of older values. For large setups, the database size shrinks by a factor of five to ten:

apt install -y timescaledb-2-postgresql-16
echo "shared_preload_libraries = 'timescaledb'" \
  >> /etc/postgresql/16/main/postgresql.conf
systemctl restart postgresql

sudo -u postgres psql zabbix \
  -c "CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;"

cat /usr/share/zabbix-sql-scripts/postgresql/timescaledb/schema.sql \
  | sudo -u zabbix psql zabbix

Step 5 — configure the Zabbix server. The configuration file lives at /etc/zabbix/zabbix_server.conf. At minimum, set the database password:

DBHost=localhost
DBName=zabbix
DBUser=zabbix
DBPassword=<the password you just set>

Step 6 — configure the front end. The Nginx snippet at /etc/zabbix/nginx.conf ships a prepared server-block file. Adjust the server name and listen port, then start the web server:

sed -i 's|# listen 8080;|listen 80;|' \
       /etc/zabbix/nginx.conf
sed -i 's|# server_name example.com;|server_name zabbix.intern.example.com;|' \
       /etc/zabbix/nginx.conf

systemctl restart zabbix-server zabbix-agent2 nginx php8.3-fpm
systemctl enable  zabbix-server zabbix-agent2 nginx php8.3-fpm

Step 7 — open the front end in the browser and walk through the setup assistant. On first visit, Zabbix checks the PHP settings, asks for the database connection details, and writes the configuration to /etc/zabbix/web/zabbix.conf.php. The default operator account is Admin with the password zabbix — change it right after the first login.

Practical note

Before going productive, put the front end behind a reverse proxy with TLS termination (Nginx, Traefik, Caddy). The bundled configuration runs over plain HTTP by default — fine for internal tests, but never for an operator login with passwords sent through a browser.

Zabbix agent on the monitored hosts

On every host that should be monitored, a dedicated agent process runs. On a Debian/Ubuntu host:

wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest+ubuntu24.04_all.deb
dpkg -i zabbix-release_latest+ubuntu24.04_all.deb
apt update
apt install -y zabbix-agent2

Adjust the configuration file /etc/zabbix/zabbix_agent2.conf — the three most important parameters:

Server=zabbix.intern.example.com
ServerActive=zabbix.intern.example.com
Hostname=db-prod-01.intern.example.com

Server allows inbound connections from the named host (passive mode). ServerActive is the peer for active mode — the agent connects outbound to this server. Hostname must match the host name configured in the front end exactly (case-sensitive), otherwise active items will not be associated.

Start the agent and enable it at boot:

systemctl restart zabbix-agent2
systemctl enable  zabbix-agent2

# Smoke test from the Zabbix server:
zabbix_get -s db-prod-01.intern.example.com -k system.uname

The host then has to be created in the front end either manually or via auto-registration. Auto-registration is the more elegant option in larger setups: the agent reports to the server, which assigns it to a host group and a template based on a host-metadata string. That saves maintenance effort as soon as several dozen hosts are involved.

Grafana and the data source

Grafana is also installed from the official repository:

apt install -y apt-transport-https software-properties-common wget
mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor | tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] \
      https://apt.grafana.com stable main" \
  > /etc/apt/sources.list.d/grafana.list

apt update
apt install -y grafana

systemctl enable --now grafana-server

Grafana then listens on TCP 3000. First login with admin / admin, change the password immediately. Then install the Zabbix plugin — the official plugin alexanderzobnin-zabbix-app is fetched through the Grafana CLI:

grafana-cli plugins install alexanderzobnin-zabbix-app
systemctl restart grafana-server

Enable the plugin in the Grafana UI (Configuration → Plugins → Zabbix → Enable) and configure it as a data source: URL of the Zabbix front end including /api_jsonrpc.php, plus user name and password of a Zabbix user with read-only rights (ideally a service account created specifically for Grafana, not the admin). A test-connection click shows immediately whether the API is reachable and authentication works.

Behind a reverse proxy, Grafana should know its own URL prefix so internal links are correct. In /etc/grafana/grafana.ini:

[server]
domain = monitoring.intern.example.com
root_url = https://monitoring.intern.example.com/grafana/
serve_from_sub_path = true

Templates, items, and triggers

The actual monitoring configuration plays out across three concepts that Zabbix separates cleanly: items describe what is measured; triggers describe when the measurement counts as a problem; templates bundle both into reusable packages.

Items — individual measurements

An item is one concrete measurement on a host. Zabbix knows fifteen item types — the most important in practice are Zabbix Agent, SNMP Agent, HTTP Agent (for REST APIs), Database Monitor (direct SQL query against a database), and Calculated (derived from other items). Every item carries a key — a Zabbix-specific notation that describes function and parameters:

system.cpu.util[,user]            # CPU utilisation in user mode
vfs.fs.size[/,pused]              # usage of / in percent
net.if.in[eth0]                   # incoming bytes on eth0
proc.num[postgres]                # number of Postgres processes
log[/var/log/syslog,error,,,skip] # new log lines containing "error"

Sampling frequency is configurable per item — typical values are 30 seconds for CPU/RAM items and 5 minutes for disk usage. Pulling every item at 10 seconds will punish you with the database size within the first three months.

Triggers — problem detection

A trigger is a boolean expression over one or more items. Once the expression turns true, the host enters the "in problem" state. The trigger syntax supports functions such as last(), avg(), min(), max(), change(), each with a time window. Two typical examples:

last(/db-prod-01/vfs.fs.size[/,pused]) > 85
# / is more than 85% used

avg(/db-prod-01/system.cpu.util[,user],5m) > 80
  and avg(/db-prod-01/system.load[all,avg5],5m) > 4
# 5-minute CPU average above 80% AND load average above 4

Triggers carry a severity level (Information, Warning, Average, High, Disaster) and can have dependencies — a "host unreachable" trigger suppresses all other triggers on the same host, so that a downed machine does not produce a hundred follow-up alarms.

Templates — reuse

A template is a collection of items, triggers, graphs, and discovery rules that can be attached to many hosts. Zabbix ships with a few hundred templates out of the box — for Linux and Windows hosts, for databases (Postgres, MySQL, Oracle, MSSQL), for web servers (Nginx, Apache), for containers (Docker), for network hardware (Cisco, Juniper, HP, MikroTik), and for many application stacks. These templates are usable as a starting point but in most setups should be forked into your own variants, because the default thresholds rarely match your operating reality.

Low-level discovery

One of the more powerful features: one item returns a list, and Zabbix automatically creates matching items, triggers, and graphs for each list entry. The classical example — file-system discovery. The agent reports all mounted file systems, and Zabbix creates a "usage in percent" item with a fitting trigger per file system. The same principle works for network interfaces, SNMP tables (e.g. switch ports), JMX beans, database tables, or custom script outputs. Anyone who understands LLD writes substantially less configuration boilerplate.

Practical note

Version templates and their forks as code in a Git repository. Zabbix can export and import templates as YAML or XML — that lets you review changes, keep test and production environments in sync, and roll back cleanly. Maintaining configuration exclusively through the UI loses this property immediately.

Dashboards in Grafana

With the Zabbix plugin and a configured data source, Grafana dashboards can be built directly on top of Zabbix items. Three patterns illustrate the spectrum.

The host-overview dashboard

A row of panels with the four classical metrics — CPU, RAM, disk, network — as single-stats or time series. A Grafana variable $host selects the displayed host; the dropdown is populated from Zabbix host groups. A single dashboard then shows the same perspective for every monitored server. This approach is the natural replacement for the "one graph per server" mania of older monitoring tools.

The service dashboard

One dashboard per business application — say "citizen portal" — that combines items from multiple hosts and components in a single view: API response time, Postgres connection count, front-end memory usage, number of open Zabbix problems. Dashboards like this often end up on the wall of the operations centre. They only work when the underlying items are named consistently — the discipline you invest in templating pays off directly here.

The SLO dashboard

A dashboard that visualises an SLO definition (Service Level Objective): availability over the last 30 days, error-budget burn, latency quantiles in percent. Dashboards of this kind often outgrow Zabbix as a single source and combine multiple data sources — Zabbix for infrastructure, Loki or Elasticsearch for application logs, perhaps Prometheus for application metrics. Exactly here lies the value of Grafana as a shared meeting point.

Alerting: Zabbix or Grafana?

Both tools can fire alerts. In practice, the following split has proven itself: Zabbix alerts on infrastructure events (host down, disk full, Postgres connections exhausted) — that is, where the data originate. Grafana alerts on composite conditions derived from multiple data sources, and on SLO violations that are not modelled as Zabbix items. Enabling both without a clean separation produces double alerting and pager fatigue.

Scaling and high availability

A single Zabbix server handles several thousand hosts in practice, provided the database is properly sized. Beyond that, additional measures become necessary. Three levers are the relevant ones.

Proxies in larger or distributed setups

A Zabbix proxy is a dedicated process that monitors a subset of hosts and delivers the collected data in bulk to the server. Three typical use cases:

Geographically distributed sites. A remote site with limited bandwidth gets its own proxy. The proxy collects data locally and sends them compressed and bundled to the central system — traffic shrinks by a factor of ten compared to individual agent connections.
DMZ setup. DMZ servers are not allowed to reach the internal Zabbix server. A proxy in the DMZ is the only component allowed to reach inward — the agents only talk to the proxy.
Load offload. With thousands of items at sub-second intervals, the server process itself becomes the bottleneck. Multiple proxies, each handling part of the host population, relieve the server process so it only needs to evaluate triggers and write to the database.

Database partitioning with TimescaleDB

The history table is the largest table in any serious Zabbix setup, because it holds raw values per item across the retention period. Without partitioning, it becomes a bottleneck quickly: every DELETE pass of the housekeeping task then takes hours. TimescaleDB solves that by splitting the table into time-based chunks — chunks that age get compressed (factor ten is typical), and chunks that are no longer needed can be dropped in milliseconds instead of being deleted row by row.

HA mode of the Zabbix server

Since version 6, Zabbix supports a native HA mode: multiple server instances read from the same database and agree, through a lock mechanism, which one is currently active. If the active server fails, a standby takes over within seconds. This HA mode does not replace database HA — Postgres has to be made highly available separately, via replication or a cluster like Patroni. But it solves the problem of a crashed server process cleanly, without requiring external cluster software like Pacemaker.

Security, retention, and operations

The operational strength of the stack shows in the fact that it can run stably for years — provided a few quirks are kept in mind during setup.

TLS between server, proxy, and agent

Agent-to-server communication is unencrypted by default. In fully segregated networks that may be acceptable, but in any setup that touches public or shared infrastructure even partially, TLS needs to be enabled. Zabbix supports two modes: PSK (pre-shared key — a symmetric secret per host or per group) and certificates (asymmetric authentication via X.509). PSK is much easier to maintain and sufficient for 90 % of setups.

# On the agent host:
openssl rand -hex 32 > /etc/zabbix/zabbix_agent2.psk
chmod 0600 /etc/zabbix/zabbix_agent2.psk
chown zabbix:zabbix /etc/zabbix/zabbix_agent2.psk

# In /etc/zabbix/zabbix_agent2.conf:
TLSConnect=psk
TLSAccept=psk
TLSPSKIdentity=db-prod-01
TLSPSKFile=/etc/zabbix/zabbix_agent2.psk

The matching PSK is configured per host (or per host group) in the front end. Connections without a correct PSK are rejected by the server.

Authentication in the front end

The Zabbix front end supports LDAP and SAML integration as well as HTTP auth. In enterprise setups, one of these mechanisms should be enabled so that user accounts do not live in Zabbix's own storage but are sourced from the central identity provider (Active Directory, Keycloak, Entra ID). Multi-factor authentication is supported in the front end from version 7 onwards.

The same applies to Grafana login: configure OAuth/OIDC, disable local users. Both platforms are operator tools — login discipline must meet industry standards.

Retention and housekeeping

Two knobs govern how long data is kept: history (raw values, typically 7–30 days) and trends (aggregated hourly and daily values, typically 365 days to 5 years). Both are configured per item, but defaults are set centrally on the template level. The housekeeping task runs inside the server process and deletes expired rows — with TimescaleDB, this step becomes a simple chunk drop and effectively free in runtime terms.

If compliance requirements affect the retention period — BAIT, KRITIS, or ISO 27001 typically expect one year for audit-relevant events — set the history retention for the affected items accordingly and document that decision in an internal service register. Zabbix does not tell you on its own which items count as "important".

Backups

The most important piece first: the database. A pg_dump of a mid-sized Zabbix database takes minutes to hours depending on size. For 24/7 setups, incremental Postgres backups with pgBackRest or WAL archiving are the better fit. The configuration under /etc/zabbix and the exported templates from the Git repository cover the rest — the server process itself is stateless.

When the stack fits — and when it doesn't

Zabbix and Grafana form a powerful, established stack — but they are not the right answer to every monitoring question. An honest placement at the end.

Fits when …

deep host and network telemetry is needed — CPU, RAM, disk, SNMP, log files, process lists — on your own infrastructure, agent-based;
a large number of hosts or a distributed topology with DMZ or remote-site setup has to be monitored, and proxies are welcome as the natural scaling mechanism;
templates and auto-discovery should shrink the configuration workload to a manageable size;
a time-series history over years has to be retained — for capacity planning, audits, or compliance;
a shared visualisation across Zabbix, Prometheus, Loki, and application databases should converge in a single tool.

Less suitable when …

you simply want to know whether a few public APIs are reachable from a user's perspective and you want to run a status page — there Gatus is productive in a fraction of the time;
you need a fully SaaS-driven monitoring solution without operating your own infrastructure — Datadog, New Relic, or Dynatrace are the obvious answers;
you run a cloud-native setup with container orchestration in which Prometheus and its ecosystem (Alertmanager, Thanos, Cortex) is already the standard — Zabbix can be integrated, but it is not the idiomatic choice;
you cannot bring the operational discipline required for a server stack with its own database, its own updates, and its own sizing — the tool punishes neglect more than a single-binary tool like Gatus.

In most of the regulated industries we work in, Zabbix is still the right tool for the infrastructure view — combined with Grafana for visualisation and, typically, complemented by a lightweight endpoint monitor such as Gatus for the outside view and the public status page. The three tools are complementary; each answers a different question. Anyone who masters all three has the full monitoring toolbox at hand — from "is the API up?" through "is Postgres burning through its connections?" to "how has the average CPU load developed over the last twelve months?"

Monitoring setup or stack consolidation?

We review your monitoring landscape together with your team — host coverage, templates, database sizing, alerting discipline, reverse-proxy and TLS configuration. The result: a concrete action plan, tailored to the size and maturity of your platform.

Arrange a conversation