Monitoring your log pipeline is as important as monitoring your applications. When Fluentd stops tailing logs or falls behind, you lose visibility into your systems. The prometheus_tail_monitor plugin solves this by exposing Fluentd’s tail input metrics to Prometheus.
After implementing log pipelines for dozens of Kubernetes clusters, we have learned that unmonitored logging infrastructure is a reliability risk. This guide covers everything you need to set up prometheus_tail_monitor in production.
What is prometheus_tail_monitor?
The fluent-plugin-prometheus gem includes several monitoring plugins for Fluentd. The prometheus_tail_monitor input plugin specifically tracks metrics from Fluentd’s in_tail plugin, which reads log files.
Key metrics exposed:
| Metric | Description |
|---|---|
fluentd_tail_file_position | Current read position in bytes |
fluentd_tail_file_inode | Inode of the file being tailed |
fluentd_tail_file_closed | Whether the file handle is closed |
fluentd_tail_file_rotation_count | Number of file rotations detected |
These metrics tell you whether Fluentd is keeping up with log production or falling behind.
Why Monitor Your Tail Inputs?
Without monitoring, you discover logging problems when you need the logs most, usually during an incident.
Common issues prometheus_tail_monitor helps detect:
- Log lag - Fluentd falling behind on high-volume logs
- Missing files - Log files that disappeared or were never created
- Rotation issues - Problems detecting file rotations
- Permission errors - Fluentd unable to read files
- Stuck positions - Fluentd stopped reading without error
We see these issues regularly when auditing observability setups. Proactive monitoring prevents the “where are my logs?” panic during incidents. This fits into Layer 10 of our production Kubernetes monitoring framework.
Installation
Install the Plugin
Add the Prometheus plugin to your Fluentd installation:
# For td-agent
td-agent-gem install fluent-plugin-prometheus
# For standalone Fluentd
fluent-gem install fluent-plugin-prometheus
For Docker-based deployments, add to your Dockerfile:
FROM fluent/fluentd:v1.16-debian
USER root
RUN gem install fluent-plugin-prometheus
USER fluent
For Kubernetes deployments using the Fluentd Helm chart, add to your values:
# values.yaml
plugins:
- fluent-plugin-prometheus
Verify Installation
Check that the plugin is loaded:
fluent-gem list | grep prometheus
# fluent-plugin-prometheus (2.1.0)
Configuration
Basic Setup
Add three components to your Fluentd configuration:
- Prometheus metrics endpoint - Exposes metrics via HTTP
- Prometheus monitor agent - Tracks Fluentd internals
- Prometheus tail monitor - Tracks tail input metrics
# fluent.conf
# Expose Prometheus metrics on port 24231
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
# Monitor Fluentd internal metrics
<source>
@type prometheus_monitor
<labels>
host ${hostname}
</labels>
</source>
# Monitor tail input plugins
<source>
@type prometheus_tail_monitor
<labels>
host ${hostname}
</labels>
</source>
# Your existing tail inputs
<source>
@type tail
@id application_logs
path /var/log/application/*.log
pos_file /var/log/fluentd/application.pos
tag application
<parse>
@type json
</parse>
</source>
<source>
@type tail
@id nginx_access_logs
path /var/log/nginx/access.log
pos_file /var/log/fluentd/nginx-access.pos
tag nginx.access
<parse>
@type nginx
</parse>
</source>
The @id directive is important. It becomes a label on your metrics, allowing you to identify which tail input has problems.
Kubernetes DaemonSet Configuration
For Kubernetes deployments, configure Fluentd as a DaemonSet with Prometheus monitoring:
# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "24231"
prometheus.io/path: "/metrics"
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-prometheus
ports:
- containerPort: 24231
name: prometheus
volumeMounts:
- name: varlog
mountPath: /var/log
- name: config
mountPath: /fluentd/etc
volumes:
- name: varlog
hostPath:
path: /var/log
- name: config
configMap:
name: fluentd-config
The annotations enable automatic discovery by Prometheus service discovery.
ConfigMap for Kubernetes
# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
# Prometheus metrics endpoint
<source>
@type prometheus
bind 0.0.0.0
port 24231
</source>
<source>
@type prometheus_monitor
</source>
<source>
@type prometheus_tail_monitor
</source>
# Tail container logs
<source>
@type tail
@id container_logs
path /var/log/containers/*.log
pos_file /var/log/fluentd/containers.pos
tag kubernetes.*
read_from_head true
<parse>
@type cri
</parse>
</source>
# Add Kubernetes metadata
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
# Forward to your destination
<match **>
@type forward
<server>
host log-aggregator.logging.svc
port 24224
</server>
</match>
Prometheus Scrape Configuration
Configure Prometheus to scrape Fluentd metrics.
Static Configuration
# prometheus.yml
scrape_configs:
- job_name: 'fluentd'
static_configs:
- targets:
- 'fluentd-1:24231'
- 'fluentd-2:24231'
- 'fluentd-3:24231'
Kubernetes Service Discovery
For Kubernetes, use pod discovery with annotations:
# prometheus.yml
scrape_configs:
- job_name: 'fluentd'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}
# Filter to fluentd pods only
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: fluentd
# Add pod name as label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
This configuration automatically discovers Fluentd pods and scrapes their metrics. See our complete Prometheus Kubernetes guide for more service discovery patterns.
Understanding the Metrics
fluentd_tail_file_position
The current byte position in the file being read.
# Current position for each tail input
fluentd_tail_file_position{plugin_id="container_logs"}
A position that stops increasing while the file grows indicates Fluentd is stuck.
Calculating Log Lag
Compare file position to file size to detect lag:
# Log lag in bytes (requires node_exporter for file size)
(node_filesystem_files_total - fluentd_tail_file_position)
For a simpler approach, alert on position not changing:
# Rate of position change (should be > 0 for active logs)
rate(fluentd_tail_file_position[5m]) == 0
fluentd_tail_file_rotation_count
Tracks file rotations detected by Fluentd.
# Rotation events per hour
increase(fluentd_tail_file_rotation_count[1h])
Excessive rotations may indicate log rotation misconfiguration.
Buffer Metrics
The prometheus_monitor plugin exposes buffer metrics:
# Buffer queue length
fluentd_output_status_buffer_queue_length
# Retry count (indicates delivery issues)
fluentd_output_status_retry_count
# Buffer total bytes
fluentd_output_status_buffer_total_bytes
High buffer queue length means Fluentd cannot forward logs fast enough.
Alerting Rules
Create alerts for common failure scenarios:
# fluentd-alerts.yaml
groups:
- name: fluentd
rules:
# Fluentd not reading logs
- alert: FluentdTailNotReading
expr: rate(fluentd_tail_file_position[10m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Fluentd tail input {{ $labels.plugin_id }} stopped reading"
description: "File position has not changed in 15 minutes. Check if logs are being written or if Fluentd is stuck."
# Fluentd buffer backing up
- alert: FluentdBufferQueueHigh
expr: fluentd_output_status_buffer_queue_length > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Fluentd buffer queue is backing up"
description: "Buffer queue length is {{ $value }}. Check downstream connectivity."
# Fluentd retrying frequently
- alert: FluentdHighRetryRate
expr: rate(fluentd_output_status_retry_count[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Fluentd experiencing high retry rate"
description: "Output plugin {{ $labels.plugin_id }} is retrying frequently. Check destination availability."
# Fluentd pod down
- alert: FluentdDown
expr: up{job="fluentd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Fluentd instance {{ $labels.pod }} is down"
description: "Prometheus cannot scrape Fluentd metrics. Check pod status."
These alerts integrate with Alertmanager for routing to Slack, PagerDuty, or email.
Grafana Dashboard
Create a dashboard to visualize Fluentd health.
Key Panels
1. Tail Input Position Rate
rate(fluentd_tail_file_position[5m])
Shows bytes/second being read from each log file.
2. Buffer Queue Length
fluentd_output_status_buffer_queue_length
Should stay low. Spikes indicate backpressure.
3. Retry Count
increase(fluentd_output_status_retry_count[1h])
Non-zero indicates delivery problems.
4. Fluentd Memory Usage
process_resident_memory_bytes{job="fluentd"}
Monitor for memory leaks.
Dashboard JSON
Import this dashboard into Grafana:
{
"title": "Fluentd Tail Monitor",
"panels": [
{
"title": "Log Read Rate by Input",
"type": "timeseries",
"targets": [
{
"expr": "rate(fluentd_tail_file_position[5m])",
"legendFormat": "{{ plugin_id }}"
}
]
},
{
"title": "Buffer Queue Length",
"type": "timeseries",
"targets": [
{
"expr": "fluentd_output_status_buffer_queue_length",
"legendFormat": "{{ plugin_id }}"
}
]
},
{
"title": "Output Retries",
"type": "stat",
"targets": [
{
"expr": "sum(increase(fluentd_output_status_retry_count[1h]))"
}
]
}
]
}
The Grafana community has pre-built Fluentd dashboards you can import and customize.
Production Best Practices
Use Meaningful Plugin IDs
Always set @id on tail inputs:
<source>
@type tail
@id nginx_access_logs # This becomes a metric label
path /var/log/nginx/access.log
# ...
</source>
Without IDs, metrics are harder to interpret.
Monitor Position File Health
Position files track where Fluentd left off. If corrupted, Fluentd may re-read or skip logs.
# Alert on position file issues
- alert: FluentdPositionFileStale
expr: time() - fluentd_tail_file_position_last_update > 3600
for: 5m
labels:
severity: warning
Set Resource Limits
Fluentd can consume significant memory with large buffers:
# Kubernetes resource limits
resources:
limits:
memory: 512Mi
cpu: 500m
requests:
memory: 256Mi
cpu: 100m
Monitor actual usage and adjust based on your log volume.
Buffer Tuning
Configure buffers to handle spikes without running out of memory:
<match **>
@type forward
<buffer>
@type file
path /var/log/fluentd/buffer
chunk_limit_size 8MB
total_limit_size 2GB
flush_interval 5s
retry_max_interval 30s
retry_forever true
</buffer>
</match>
File-based buffers survive restarts. Memory buffers are faster but lost on crash.
Separate Metrics Port from Log Port
Keep the Prometheus metrics port (24231) separate from log ingestion ports (24224):
# Metrics - expose to Prometheus
<source>
@type prometheus
bind 0.0.0.0
port 24231
</source>
# Log ingestion - internal only
<source>
@type forward
bind 0.0.0.0
port 24224
</source>
This allows different network policies for observability vs. data ingestion.
Troubleshooting
Metrics Not Appearing
-
Check plugin installation:
fluent-gem list | grep prometheus -
Verify configuration syntax:
fluentd --dry-run -c /etc/fluentd/fluent.conf -
Check metrics endpoint:
curl http://localhost:24231/metrics | grep fluentd_tail -
Check Fluentd logs:
kubectl logs -n logging -l app=fluentd
Position Not Increasing
If fluentd_tail_file_position is static:
-
Check if logs are being written:
tail -f /var/log/application.log -
Check file permissions:
ls -la /var/log/application.log -
Verify pos_file is writable:
ls -la /var/log/fluentd/ -
Check for parsing errors in Fluentd logs:
grep -i error /var/log/fluentd/fluentd.log
High Buffer Queue
If fluentd_output_status_buffer_queue_length is growing:
-
Check downstream connectivity:
nc -zv log-aggregator.logging.svc 24224 -
Increase flush workers:
<buffer> flush_thread_count 4 </buffer> -
Check for rate limiting at destination
Integration with Log Management
The metrics from prometheus_tail_monitor complement your log management solution. Use them to:
- Verify logs are flowing before investigating missing data
- Correlate log gaps with infrastructure issues
- Capacity plan based on log volume trends
- Alert before log storage fills up
For comprehensive observability, combine with OpenTelemetry for distributed tracing to correlate logs, metrics, and traces.
Summary
Monitoring your log pipeline with prometheus_tail_monitor provides visibility into a critical piece of infrastructure. The setup is straightforward:
- Install
fluent-plugin-prometheus - Add prometheus, prometheus_monitor, and prometheus_tail_monitor sources
- Configure Prometheus to scrape Fluentd
- Create alerts for stuck positions and buffer issues
- Build dashboards for operational visibility
This ensures you know about logging problems before you need the logs.
Need Help With Log Monitoring?
We implement production logging and monitoring pipelines for organizations running Kubernetes. Our Prometheus consulting services include Fluentd integration, alerting, and dashboard development.
Book a free 30-minute consultation to discuss your observability needs.