Remote Support Start download

TrueNAS SMART Monitoring in Grafana: Disk Health on a Dashboard

TrueNASMonitoringGrafanaSMART
TrueNAS SMART Monitoring in Grafana: Disk Health on a Dashboard

Hard drives announce their failure — through rising reallocated sectors, pending sectors and temperatures. TrueNAS collects all this SMART data reliably, but the web UI only shows snapshots. To detect an impending disk failure early, you need time series, dashboards and alerts.

That is exactly what Prometheus and Grafana deliver. In this article we show how to export SMART data from TrueNAS, scrape it with Prometheus, visualize it in Grafana and configure alerts on pre-failure attributes — so you can plan disk replacements instead of reacting to emergencies.

Why SMART Monitoring on a Dashboard?

The TrueNAS web UI shows SMART values as a per-disk table — a snapshot without history. Three problems arise:

  • No trends visible: Whether Reallocated_Sector_Ct has been slowly rising for three weeks is invisible. You only see the current value.
  • No pool-wide overview: With 24 disks per pool, manually inspecting each drive is unrealistic.
  • No alerts on changes: TrueNAS only warns when a SMART test fails — not when pre-fail values are getting worse.

A Grafana dashboard solves all three: temperature curves over 30 days, reallocated-sector trends at pool level, power-on hours per disk, and alerts the moment a critical attribute crosses a threshold. You see not only that a disk has problems right now, but also when the trend reversed.

Architecture: From TrueNAS to a Grafana Panel

The data path consists of three components:

+--------------+        +--------------+        +-----------+        +---------+
|  TrueNAS     |  ---> |  Exporter    | <----  | Prometheus | ---> | Grafana |
|  smartctl    |        | (netdata     |        | Scrape +   |       | Panels  |
|  /dev/sda    |        |  or          |        | TSDB       |       | Alerts  |
|              |        |  textfile)   |        |            |       |         |
+--------------+        +--------------+        +-----------+        +---------+

Two field-proven exporter variants have established themselves:

  • Netdata as collector: Netdata runs as an app on TrueNAS SCALE, collects SMART data plus dozens of system metrics and exposes them in Prometheus format at /api/v1/allmetrics?format=prometheus. Low setup effort, many out-of-the-box metrics.
  • Textfile exporter: A cron job invokes smartctl, writes the values to a .prom file, and the node exporter reads it. Maximum control over the exported fields, ideal for dedicated SMART dashboards.

For SMB environments we recommend Netdata because the overhead is minimal. In larger setups with dozens of disks and tailored alerts, the textfile approach is often the better fit.

Variant A: Netdata on TrueNAS SCALE 25.10

On TrueNAS SCALE you install Netdata from the apps catalog. The Prometheus endpoint is then immediately reachable:

curl http://truenas.lan:19999/api/v1/allmetrics?format=prometheus | grep smart

The output contains metrics like:

smart_log_attribute_value{device="sda",attribute="reallocated_sector_ct"} 0
smart_log_attribute_value{device="sda",attribute="current_pending_sector"} 0
smart_log_attribute_raw{device="sda",attribute="temperature_celsius"} 38
smart_log_attribute_raw{device="sda",attribute="power_on_hours"} 18432

In the Prometheus configuration you add a job:

scrape_configs:
  - job_name: 'truenas-netdata'
    metrics_path: /api/v1/allmetrics
    params:
      format: ['prometheus']
    scrape_interval: 60s
    static_configs:
      - targets: ['truenas.lan:19999']
        labels:
          host: 'truenas-prod'

After a Prometheus reload, the metrics appear in the browser at http://prometheus.lan:9090/graph.

Variant B: The smartmon Textfile Exporter

If you already run node exporter, you can use the official smartmon.sh collector. On TrueNAS SCALE (Debian-based), you install the script once and create a cron job:

# install /usr/local/sbin/smartmon.sh (simplified)
cat > /etc/cron.d/smartmon <<'EOF'
*/5 * * * * root /usr/local/sbin/smartmon.sh > /var/lib/node_exporter/textfile/smartmon.prom.$$ \
  && mv /var/lib/node_exporter/textfile/smartmon.prom.$$ /var/lib/node_exporter/textfile/smartmon.prom
EOF

Node exporter is started with --collector.textfile.directory=/var/lib/node_exporter/textfile and then delivers metrics like:

smartmon_attr_value{disk="/dev/sda",attribute_name="Reallocated_Sector_Ct"} 0
smartmon_attr_value{disk="/dev/sda",attribute_name="Current_Pending_Sector"} 0
smartmon_attr_raw_value{disk="/dev/sda",attribute_name="Temperature_Celsius"} 38
smartmon_device_smart_healthy{disk="/dev/sda",model="WDC WD80EFAX"} 1

The advantage: you control which attributes get exported and can add labels like model, serial number or pool per disk.

The Most Important SMART Metrics for the Dashboard

Not every SMART value is relevant. Focus on the pre-fail attributes that statistically actually predict failures (Backblaze studies as reference):

AttributeIDWhat it showsGrafana panel
Reallocated_Sector_Ct5Defective, replaced sectorsStat + time series, highlight > 0
Current_Pending_Sector197Unstable sectors, not yet replacedStat, > 0 = alert
Offline_Uncorrectable198Uncorrectable sectorsStat, > 0 = alert
UDMA_CRC_Error_Count199Cable or controller errorsTime series, watch slope
Temperature_Celsius194Current disk temperatureHeatmap, warn at > 45 C
Power_On_Hours9Operating hoursStat, lifecycle context
Wear_Leveling_Count173SSD wearGauge, for NVMe/SSD pools

A good dashboard combines pool overview (count of disks with pending sectors > 0), per-disk detail panels (temperature curve, reallocation trend) and a top-N view (e.g. “5 hottest disks”).

Alerting on Pre-Failure Attributes

Prometheus alerts belong in a separate rules.yml. Three rules cover the most important cases:

groups:
  - name: truenas-smart
    interval: 60s
    rules:
      - alert: SmartPendingSectorsDetected
        expr: smartmon_attr_raw_value{attribute_name="Current_Pending_Sector"} > 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Pending sectors on {{ $labels.disk }} -- plan disk replacement"

      - alert: SmartReallocatedSectorsRising
        expr: increase(smartmon_attr_raw_value{attribute_name="Reallocated_Sector_Ct"}[24h]) > 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Reallocated sectors rising on {{ $labels.disk }}"

      - alert: SmartDiskTemperatureHigh
        expr: smartmon_attr_raw_value{attribute_name="Temperature_Celsius"} > 50
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk temperature {{ $value }} C on {{ $labels.disk }}"

The trend rule with increase(...[24h]) > 0 is key: it fires on any deterioration — even a reallocation from 0 to 1. That way you catch the start of degradation, not just the late stage.

Real-World Workflow: From Alert to Disk Replacement

A typical lifecycle of an alert in a managed environment:

  1. Day 0: Grafana shows the first pending sectors on /dev/sdf. Prometheus fires an alert to Alertmanager.
  2. Day 0: Alert lands in our ticketing system via webhook, status: “disk under observation”.
  3. Days 1-3: Compare with pool status (zpool status), check whether ZFS already reports CKSUM errors, trigger a long SMART test.
  4. Days 3-5: Replacement disk is ordered, resilver window is scheduled with the customer.
  5. Days 5-7: Disk is replaced during operation, the pool resilvers automatically — with zero downtime.

Without monitoring, the disk would probably only have surfaced at the next zpool scrub — possibly together with the second disk in the mirror, which would have caused data loss.

Conclusion

SMART values are the most honest signal a hard drive emits — but only those who measure them continuously can use them. With Netdata or the smartmon exporter, Prometheus as the time-series database and Grafana as the dashboard, you build a disk-health platform that surfaces pre-failure attributes weeks before a failure.

The setup effort is manageable, the benefit measurable: fewer emergency call-outs, planned disk swaps, higher availability of your ZFS storage.


DATAZONE supports you in building a complete monitoring stack for your TrueNAS environment — from SMART exporters through Prometheus rules to Grafana dashboards and Alertmanager routing. We also operate the solution continuously as part of our Linux and storage managed services. Contact us for an initial consultation.

More on these topics:

Need IT consulting?

Contact us for a no-obligation consultation on Proxmox, OPNsense, TrueNAS and more.

Get in touch