Engineering

Prometheus Tail Monitor for Fluentd: Complete Setup Guide

Amjad Syed - Founder & CEO

Monitoring your log pipeline is as important as monitoring your applications. When Fluentd stops tailing logs or falls behind, you lose visibility into your systems. The prometheus_tail_monitor plugin solves this by exposing Fluentd’s tail input metrics to Prometheus.

After implementing log pipelines for dozens of Kubernetes clusters, we have learned that unmonitored logging infrastructure is a reliability risk. This guide covers everything you need to set up prometheus_tail_monitor in production.

What is prometheus_tail_monitor?

The fluent-plugin-prometheus gem includes several monitoring plugins for Fluentd. The prometheus_tail_monitor input plugin specifically tracks metrics from Fluentd’s in_tail plugin, which reads log files.

Key metrics exposed:

MetricDescription
fluentd_tail_file_positionCurrent read position in bytes
fluentd_tail_file_inodeInode of the file being tailed
fluentd_tail_file_closedWhether the file handle is closed
fluentd_tail_file_rotation_countNumber of file rotations detected

These metrics tell you whether Fluentd is keeping up with log production or falling behind.

Why Monitor Your Tail Inputs?

Without monitoring, you discover logging problems when you need the logs most, usually during an incident.

Common issues prometheus_tail_monitor helps detect:

  • Log lag - Fluentd falling behind on high-volume logs
  • Missing files - Log files that disappeared or were never created
  • Rotation issues - Problems detecting file rotations
  • Permission errors - Fluentd unable to read files
  • Stuck positions - Fluentd stopped reading without error

We see these issues regularly when auditing observability setups. Proactive monitoring prevents the “where are my logs?” panic during incidents. This fits into Layer 10 of our production Kubernetes monitoring framework.

Installation

Install the Plugin

Add the Prometheus plugin to your Fluentd installation:

# For td-agent
td-agent-gem install fluent-plugin-prometheus

# For standalone Fluentd
fluent-gem install fluent-plugin-prometheus

For Docker-based deployments, add to your Dockerfile:

FROM fluent/fluentd:v1.16-debian

USER root
RUN gem install fluent-plugin-prometheus
USER fluent

For Kubernetes deployments using the Fluentd Helm chart, add to your values:

# values.yaml
plugins:
  - fluent-plugin-prometheus

Verify Installation

Check that the plugin is loaded:

fluent-gem list | grep prometheus
# fluent-plugin-prometheus (2.1.0)

Configuration

Basic Setup

Add three components to your Fluentd configuration:

  1. Prometheus metrics endpoint - Exposes metrics via HTTP
  2. Prometheus monitor agent - Tracks Fluentd internals
  3. Prometheus tail monitor - Tracks tail input metrics
# fluent.conf

# Expose Prometheus metrics on port 24231
<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

# Monitor Fluentd internal metrics
<source>
  @type prometheus_monitor
  <labels>
    host ${hostname}
  </labels>
</source>

# Monitor tail input plugins
<source>
  @type prometheus_tail_monitor
  <labels>
    host ${hostname}
  </labels>
</source>

# Your existing tail inputs
<source>
  @type tail
  @id application_logs
  path /var/log/application/*.log
  pos_file /var/log/fluentd/application.pos
  tag application
  <parse>
    @type json
  </parse>
</source>

<source>
  @type tail
  @id nginx_access_logs
  path /var/log/nginx/access.log
  pos_file /var/log/fluentd/nginx-access.pos
  tag nginx.access
  <parse>
    @type nginx
  </parse>
</source>

The @id directive is important. It becomes a label on your metrics, allowing you to identify which tail input has problems.

Kubernetes DaemonSet Configuration

For Kubernetes deployments, configure Fluentd as a DaemonSet with Prometheus monitoring:

# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "24231"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: fluentd
          image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-prometheus
          ports:
            - containerPort: 24231
              name: prometheus
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluentd/etc
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluentd-config

The annotations enable automatic discovery by Prometheus service discovery.

ConfigMap for Kubernetes

# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    # Prometheus metrics endpoint
    <source>
      @type prometheus
      bind 0.0.0.0
      port 24231
    </source>

    <source>
      @type prometheus_monitor
    </source>

    <source>
      @type prometheus_tail_monitor
    </source>

    # Tail container logs
    <source>
      @type tail
      @id container_logs
      path /var/log/containers/*.log
      pos_file /var/log/fluentd/containers.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type cri
      </parse>
    </source>

    # Add Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    # Forward to your destination
    <match **>
      @type forward
      <server>
        host log-aggregator.logging.svc
        port 24224
      </server>
    </match>

Prometheus Scrape Configuration

Configure Prometheus to scrape Fluentd metrics.

Static Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'fluentd'
    static_configs:
      - targets:
          - 'fluentd-1:24231'
          - 'fluentd-2:24231'
          - 'fluentd-3:24231'

Kubernetes Service Discovery

For Kubernetes, use pod discovery with annotations:

# prometheus.yml
scrape_configs:
  - job_name: 'fluentd'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the prometheus.io/port annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}
      # Filter to fluentd pods only
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: fluentd
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

This configuration automatically discovers Fluentd pods and scrapes their metrics. See our complete Prometheus Kubernetes guide for more service discovery patterns.

Understanding the Metrics

fluentd_tail_file_position

The current byte position in the file being read.

# Current position for each tail input
fluentd_tail_file_position{plugin_id="container_logs"}

A position that stops increasing while the file grows indicates Fluentd is stuck.

Calculating Log Lag

Compare file position to file size to detect lag:

# Log lag in bytes (requires node_exporter for file size)
(node_filesystem_files_total - fluentd_tail_file_position)

For a simpler approach, alert on position not changing:

# Rate of position change (should be > 0 for active logs)
rate(fluentd_tail_file_position[5m]) == 0

fluentd_tail_file_rotation_count

Tracks file rotations detected by Fluentd.

# Rotation events per hour
increase(fluentd_tail_file_rotation_count[1h])

Excessive rotations may indicate log rotation misconfiguration.

Buffer Metrics

The prometheus_monitor plugin exposes buffer metrics:

# Buffer queue length
fluentd_output_status_buffer_queue_length

# Retry count (indicates delivery issues)
fluentd_output_status_retry_count

# Buffer total bytes
fluentd_output_status_buffer_total_bytes

High buffer queue length means Fluentd cannot forward logs fast enough.

Alerting Rules

Create alerts for common failure scenarios:

# fluentd-alerts.yaml
groups:
  - name: fluentd
    rules:
      # Fluentd not reading logs
      - alert: FluentdTailNotReading
        expr: rate(fluentd_tail_file_position[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Fluentd tail input {{ $labels.plugin_id }} stopped reading"
          description: "File position has not changed in 15 minutes. Check if logs are being written or if Fluentd is stuck."

      # Fluentd buffer backing up
      - alert: FluentdBufferQueueHigh
        expr: fluentd_output_status_buffer_queue_length > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Fluentd buffer queue is backing up"
          description: "Buffer queue length is {{ $value }}. Check downstream connectivity."

      # Fluentd retrying frequently
      - alert: FluentdHighRetryRate
        expr: rate(fluentd_output_status_retry_count[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Fluentd experiencing high retry rate"
          description: "Output plugin {{ $labels.plugin_id }} is retrying frequently. Check destination availability."

      # Fluentd pod down
      - alert: FluentdDown
        expr: up{job="fluentd"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Fluentd instance {{ $labels.pod }} is down"
          description: "Prometheus cannot scrape Fluentd metrics. Check pod status."

These alerts integrate with Alertmanager for routing to Slack, PagerDuty, or email.

Grafana Dashboard

Create a dashboard to visualize Fluentd health.

Key Panels

1. Tail Input Position Rate

rate(fluentd_tail_file_position[5m])

Shows bytes/second being read from each log file.

2. Buffer Queue Length

fluentd_output_status_buffer_queue_length

Should stay low. Spikes indicate backpressure.

3. Retry Count

increase(fluentd_output_status_retry_count[1h])

Non-zero indicates delivery problems.

4. Fluentd Memory Usage

process_resident_memory_bytes{job="fluentd"}

Monitor for memory leaks.

Dashboard JSON

Import this dashboard into Grafana:

{
  "title": "Fluentd Tail Monitor",
  "panels": [
    {
      "title": "Log Read Rate by Input",
      "type": "timeseries",
      "targets": [
        {
          "expr": "rate(fluentd_tail_file_position[5m])",
          "legendFormat": "{{ plugin_id }}"
        }
      ]
    },
    {
      "title": "Buffer Queue Length",
      "type": "timeseries",
      "targets": [
        {
          "expr": "fluentd_output_status_buffer_queue_length",
          "legendFormat": "{{ plugin_id }}"
        }
      ]
    },
    {
      "title": "Output Retries",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(increase(fluentd_output_status_retry_count[1h]))"
        }
      ]
    }
  ]
}

The Grafana community has pre-built Fluentd dashboards you can import and customize.

Production Best Practices

Use Meaningful Plugin IDs

Always set @id on tail inputs:

<source>
  @type tail
  @id nginx_access_logs  # This becomes a metric label
  path /var/log/nginx/access.log
  # ...
</source>

Without IDs, metrics are harder to interpret.

Monitor Position File Health

Position files track where Fluentd left off. If corrupted, Fluentd may re-read or skip logs.

# Alert on position file issues
- alert: FluentdPositionFileStale
  expr: time() - fluentd_tail_file_position_last_update > 3600
  for: 5m
  labels:
    severity: warning

Set Resource Limits

Fluentd can consume significant memory with large buffers:

# Kubernetes resource limits
resources:
  limits:
    memory: 512Mi
    cpu: 500m
  requests:
    memory: 256Mi
    cpu: 100m

Monitor actual usage and adjust based on your log volume.

Buffer Tuning

Configure buffers to handle spikes without running out of memory:

<match **>
  @type forward
  <buffer>
    @type file
    path /var/log/fluentd/buffer
    chunk_limit_size 8MB
    total_limit_size 2GB
    flush_interval 5s
    retry_max_interval 30s
    retry_forever true
  </buffer>
</match>

File-based buffers survive restarts. Memory buffers are faster but lost on crash.

Separate Metrics Port from Log Port

Keep the Prometheus metrics port (24231) separate from log ingestion ports (24224):

# Metrics - expose to Prometheus
<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
</source>

# Log ingestion - internal only
<source>
  @type forward
  bind 0.0.0.0
  port 24224
</source>

This allows different network policies for observability vs. data ingestion.

Troubleshooting

Metrics Not Appearing

  1. Check plugin installation:

    fluent-gem list | grep prometheus
  2. Verify configuration syntax:

    fluentd --dry-run -c /etc/fluentd/fluent.conf
  3. Check metrics endpoint:

    curl http://localhost:24231/metrics | grep fluentd_tail
  4. Check Fluentd logs:

    kubectl logs -n logging -l app=fluentd

Position Not Increasing

If fluentd_tail_file_position is static:

  1. Check if logs are being written:

    tail -f /var/log/application.log
  2. Check file permissions:

    ls -la /var/log/application.log
  3. Verify pos_file is writable:

    ls -la /var/log/fluentd/
  4. Check for parsing errors in Fluentd logs:

    grep -i error /var/log/fluentd/fluentd.log

High Buffer Queue

If fluentd_output_status_buffer_queue_length is growing:

  1. Check downstream connectivity:

    nc -zv log-aggregator.logging.svc 24224
  2. Increase flush workers:

    <buffer>
      flush_thread_count 4
    </buffer>
  3. Check for rate limiting at destination

Integration with Log Management

The metrics from prometheus_tail_monitor complement your log management solution. Use them to:

  • Verify logs are flowing before investigating missing data
  • Correlate log gaps with infrastructure issues
  • Capacity plan based on log volume trends
  • Alert before log storage fills up

For comprehensive observability, combine with OpenTelemetry for distributed tracing to correlate logs, metrics, and traces.

Summary

Monitoring your log pipeline with prometheus_tail_monitor provides visibility into a critical piece of infrastructure. The setup is straightforward:

  1. Install fluent-plugin-prometheus
  2. Add prometheus, prometheus_monitor, and prometheus_tail_monitor sources
  3. Configure Prometheus to scrape Fluentd
  4. Create alerts for stuck positions and buffer issues
  5. Build dashboards for operational visibility

This ensures you know about logging problems before you need the logs.


Need Help With Log Monitoring?

We implement production logging and monitoring pipelines for organizations running Kubernetes. Our Prometheus consulting services include Fluentd integration, alerting, and dashboard development.

Book a free 30-minute consultation to discuss your observability needs.

Chat with real humans
Chat on WhatsApp