Remote Support Start download

Grafana, Prometheus and Loki: Self-Hosted Monitoring Stack

MonitoringGrafanaSelf-HostingLinux
Grafana, Prometheus and Loki: Self-Hosted Monitoring Stack

Monitoring is often a neglected child in SMB environments: either it is missing entirely, or there is an expensive SaaS subscription that gets more expensive every year as hosts and logs pile up. Yet a fully featured observability stack with Grafana, Prometheus and Loki can be run on a single Linux VM — including metrics, logs, alerting and dashboards. In this article we show you what such a stack looks like in 2026, what storage sizes to plan for a 30-day retention, and which pitfalls we know from customer projects.

The idea behind it is simple: Prometheus collects metrics (CPU, RAM, disk, network, SMART, SNMP), Loki collects logs (syslog, journald, container logs), Grafana visualises both and triggers alerts. Everything as containers, everything versioned, everything reproducible.

Architecture and sizing of the monitoring VM

For a typical SMB customer with 20 to 50 monitored hosts (Proxmox nodes, TrueNAS, OPNsense, switches, Windows servers) a single VM is completely sufficient. We recommend a Debian 12 or Ubuntu 24.04 LTS VM on the Proxmox cluster with the following specs:

ComponentSizingNote
vCPU4 coresenough for 50 targets at 15s scrape
RAM8 GBPrometheus 3 GB, Loki 2 GB, Grafana 1 GB
Boot disk32 GBOS, Docker, compose files
Data disk200—500 GBTSDB plus Loki chunks, see storage sizing
Network1 GbEmore than enough

The data disk is deliberately attached as a separate virtual drive so that a VM snapshot does not bloat with TSDB content. On the storage layer the data disk ideally lives on a ZFS pool with SSDs — the Prometheus TSDB is random-write-heavy and does not like spinning disks.

Docker Compose layout

We bundle the entire stack in a single compose file under /opt/observability/. The advantage: updates, backups and restore all flow through a single path. Configuration files are bind-mounted, the data lives on the separate data disk under /var/lib/observability/.

services:
  prometheus:
    image: prom/prometheus:v3.2.1
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - /var/lib/observability/prometheus:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --storage.tsdb.retention.size=120GB
    restart: unless-stopped

  loki:
    image: grafana/loki:3.4.1
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml:ro
      - /var/lib/observability/loki:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.5.0
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - /var/lib/observability/grafana:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin
    secrets:
      - grafana_admin
    restart: unless-stopped

  snmp-exporter:
    image: prom/snmp-exporter:v0.28.0
    volumes:
      - ./snmp:/etc/snmp_exporter:ro
    restart: unless-stopped

secrets:
  grafana_admin:
    file: ./secrets/grafana_admin.txt

Important: no ports: exposing Prometheus and Loki to the outside. Access is exclusively through Grafana, and Grafana itself sits behind a reverse proxy (Caddy or Traefik) with a Let’s Encrypt certificate and basic auth or OIDC.

Prometheus scrape configuration in practice

The prometheus.yml is the heart of the stack. We recommend splitting it into logical job groups rather than dumping everything into a flat list. For a typical customer it looks like this:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    site: neuburg-hq

scrape_configs:
  - job_name: node
    file_sd_configs:
      - files: [/etc/prometheus/targets/node-*.yml]

  - job_name: proxmox-pve
    metrics_path: /pve
    static_configs:
      - targets: [pve01.intern, pve02.intern, pve03.intern]
    params:
      module: [default]

  - job_name: snmp-switches
    static_configs:
      - targets: [sw-core.intern, sw-acc01.intern, sw-acc02.intern]
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

The split via file_sd_configs has a huge benefit: new hosts can be added without restarting Prometheus — a simple echo into the YAML file is enough, Prometheus re-reads the targets files every 30 seconds. For SNMP monitoring of switches we use the if_mib module of snmp-exporter, which delivers interface counters, error counters and link status. More on network integration in our article on OPNsense.

Loki and promtail for log aggregation

Loki is the third pillar. Unlike ELK, Loki does not index the log content, only labels — which makes storage consumption roughly an order of magnitude smaller. For most SMB use cases (audit logs, auth logs, container logs) that is perfectly adequate. On each monitored host runs promtail as a small agent that ships journald and selected files to Loki.

A lean promtail-config.yml on a Linux host looks like this:

server:
  http_listen_port: 9080
positions:
  filename: /var/lib/promtail/positions.yaml
clients:
  - url: http://monitoring.intern:3100/loki/api/v1/push
scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: ${HOSTNAME}
    relabel_configs:
      - source_labels: [__journal__systemd_unit]
        target_label: unit

On TrueNAS systems middlewared logs and SMB audit logs can also be picked up by promtail — at the latest when a customer needs auditable traceability, this is worth gold. Details on the storage platform on our TrueNAS page.

Storage sizing for 30-day retention

The question we hear most often: “How big does the data disk have to be?” The answer depends on the number of metrics and the log volume. From our projects, the following rules of thumb have emerged:

ComponentRule of thumbExample 30 hosts
Prometheus TSDBapprox. 1.5 KB per sample per day, 1500 series per node~50 GB for 30 days
Loki chunksapprox. 10 % of raw log volume after compression~30 GB at 10 GB logs/day
Grafana DB~500 MBnegligible
Buffer and WAL20 % reserve~16 GB
Total recommendation200 GB data disk

Important: --storage.tsdb.retention.size in Prometheus should be about 60 % of the available disk size, so you keep buffer for WAL, compaction and unexpected load spikes. Limit Loki analogously via the retention_period in its config.

Grafana with provisioning — never click-config again

The biggest win only kicks in once you roll out datasources and dashboards via file provisioning. This makes the setup reproducible, versionable in Git and disaster-recovery ready. Under ./grafana/provisioning/datasources/datasources.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    url: http://loki:3100

Dashboards are placed as JSON files under ./grafana/provisioning/dashboards/. For a quick start, the official dashboards with IDs 1860 (Node Exporter Full), 10242 (SNMP Interface Detail) and 14055 (Loki Logs) work well. However, you should adapt them to your label schema — experience shows generic dashboards only work at about 70 %.

For alerting we use Grafana Unified Alerting with contact points to email and Microsoft Teams. The keys are sensible hysteresis thresholds and for: 10m clauses, otherwise the stack will flood your inbox.

Backup strategy

The entire stack lives under two directories: /opt/observability/ (config, in Git) and /var/lib/observability/ (data). For backup a nightly restic job on the data disk is enough, plus a VM snapshot via Proxmox Backup. Recommendation: keep the repository of compose and config files in an internal Git so that a bare-metal rebuild is done in under 30 minutes. Anyone who already has a backup workflow for their core infrastructure can simply slot in the monitoring VM.

Conclusion

A self-hosted monitoring stack with Grafana, Prometheus and Loki is no longer a hobby project in 2026 but a production-ready alternative to commercial SaaS offerings. With roughly 4 vCPU, 8 GB RAM and 200 GB of storage you cover a typical SMB setup with 30 to 50 hosts including 30 days of retention. The levers are a clean compose layout, file-based service discovery, provisioning all Grafana content and a disciplined backup strategy. Anyone who additionally feeds in container logs from a Kubernetes cluster or TrueNAS audits gets a complete observability platform for a fraction of the running cost of a SaaS subscription.

DATAZONE supports you with planning, building and operating your monitoring stack — from initial VM sizing through defining sensible alert rules to dashboard tuning for your specific infrastructure. Talk to us, we bring experience from dozens of Linux, Proxmox and TrueNAS environments. Get in touch.

More on these topics:

Need IT consulting?

Contact us for a no-obligation consultation on Proxmox, OPNsense, TrueNAS and more.

Get in touch