Introduction
Prometheus has become the de facto standard for monitoring Kubernetes clusters and cloud-native applications. While getting Prometheus running is straightforward, setting it up for production with proper security, authentication, persistence, and alerting requires careful planning.
In this comprehensive guide, I’ll walk you through deploying a production-ready Prometheus stack on Kubernetes with:
- Secure authentication using HashiCorp Vault and Traefik middleware
- High availability configuration with StatefulSets
- Persistent storage for long-term metrics retention
- Comprehensive alerting rules for infrastructure and application monitoring
- Network policies for zero-trust security
- Resource management to prevent resource exhaustion
- Best practices for production deployments
By the end of this guide, you’ll have a robust monitoring solution that’s secure, scalable, and ready for production workloads.
Why Production-Ready Prometheus Matters
Many teams start with a basic Prometheus deployment, only to encounter issues in production:
- Data loss from lack of persistent storage
- Security vulnerabilities from exposed, unauthenticated endpoints
- Resource exhaustion from missing resource limits
- Alert fatigue from poorly configured alerting rules
- Network security gaps without proper network policies
This guide addresses all these concerns with battle-tested configurations.
Architecture Overview
Our Prometheus deployment consists of several components:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| ┌─────────────────────────────────────────────────────────────┐
│ Internet / Users │
└───────────────────────────┬─────────────────────────────────┘
│ HTTPS (TLS)
▼
┌───────────────────────┐
│ Traefik Ingress │
│ + Basic Auth │
│ + TLS Termination │
└───────────┬───────────┘
│
▼
┌───────────────────────────────┐
│ Prometheus Server │
│ - Metrics Collection │
│ - Query Engine │
│ - TSDB Storage │
│ - Alert Evaluation │
└───────┬──────────┬────────────┘
│ │
┌───────────┘ └────────────┐
│ │
▼ ▼
┌───────────────┐ ┌──────────────┐
│ Alertmanager │ │ Exporters │
│ - Alert │ │ │
│ Routing │ │ • Node │
│ - Dedup │ │ • Kube State │
│ - Silencing │ │ • Pushgateway│
└───────────────┘ └──────────────┘
│
▼
┌────────────────┐
│ Alert Channels │
│ (Slack, Email) │
└────────────────┘
|
Components
- Prometheus Server: Core time-series database and query engine
- Alertmanager: Alert deduplication, grouping, and routing
- Node Exporter: Hardware and OS metrics (DaemonSet on every node)
- Kube State Metrics: Kubernetes API object metrics
- Pushgateway: Metrics collection for short-lived jobs
- Traefik Middleware: Authentication layer
- External Secrets: Secure credential management via Vault
Prerequisites
Before we begin, ensure you have:
Infrastructure Requirements
- Kubernetes cluster (v1.20+)
- kubectl configured and authenticated
- Helm v3 or Helmfile installed
- Storage class for persistent volumes (50Gi for Prometheus, 2Gi for Alertmanager)
Required Add-ons
- Traefik ingress controller
- cert-manager with Let’s Encrypt issuer configured
- HashiCorp Vault deployed and configured
- External Secrets Operator installed
1
2
3
4
5
6
7
| # Install required CLI tools
brew install kubernetes-cli helm helmfile apache2-utils vault
# Verify versions
kubectl version --client
helm version
helmfile version
|
Step 1: Deploy Prometheus with Helmfile
We’ll use Helmfile for declarative, version-controlled deployments.
1.1 Create Helmfile Configuration
Create helmfile.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| helmDefaults:
createNamespace: true
timeout: 300
wait: false
repositories:
- name: prometheus-community
url: https://prometheus-community.github.io/helm-charts
releases:
- name: prometheus
namespace: prometheus
chart: prometheus-community/prometheus
version: "28.6.0"
values:
- ./values.yml
|
1.2 Create Comprehensive Values Configuration
Create values.yml with production-ready settings:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
| # Prometheus Server Configuration
server:
baseUrl: "https://prometheus.example.com"
replicaCount: 1
statefulSet:
enabled: true # Use StatefulSet for stable network identity
# Resource Management - Critical for stability
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 500m
memory: 1Gi
# Persistent Storage - CRITICAL for data retention
persistentVolume:
enabled: true
size: 50Gi
# storageClass: "" # Use default storage class
# Data Retention
retention: "30d" # Keep metrics for 30 days
retentionSize: "45GB" # Max storage size
# Security Context - Run as non-root
securityContext:
runAsUser: 65534
runAsNonRoot: true
runAsGroup: 65534
fsGroup: 65534
# High Availability
podDisruptionBudget:
enabled: true
minAvailable: 1
# Anti-affinity for spreading pods across nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- prometheus-server
topologyKey: kubernetes.io/hostname
# Ingress Configuration
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod-issuer
traefik.ingress.kubernetes.io/router.middlewares: prometheus-prometheus-auth@kubernetescrd
hosts:
- prometheus.example.com
tls:
- secretName: prometheus-tls
hosts:
- prometheus.example.com
# Global Prometheus Configuration
global:
scrape_interval: 30s
scrape_timeout: 10s
evaluation_interval: 30s
external_labels:
cluster: "production"
environment: "production"
# Enable ServiceMonitor support
serviceMonitor:
enabled: true
# Alertmanager Configuration
alertmanager:
enabled: true
replicaCount: 2 # HA setup
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 50m
memory: 128Mi
persistentVolume:
enabled: true
size: 2Gi
securityContext:
runAsUser: 65534
runAsNonRoot: true
runAsGroup: 65534
fsGroup: 65534
podDisruptionBudget:
enabled: true
minAvailable: 1
# Node Exporter - Metrics from every node
nodeExporter:
enabled: true
resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 50m
memory: 30Mi
# Host networking for accurate metrics
hostNetwork: true
hostPID: true
# Kube State Metrics - Kubernetes object metrics
kubeStateMetrics:
enabled: true
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 50m
memory: 128Mi
# Pushgateway - For batch jobs
pushgateway:
enabled: true
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 50m
memory: 128Mi
persistentVolume:
enabled: true
size: 2Gi
|
1.3 Deploy Prometheus
1
2
3
4
5
6
7
8
9
10
11
| # Navigate to your prometheus directory
cd k8s/releases/prometheus
# Preview what will be deployed
helmfile template
# Deploy Prometheus
helmfile sync
# Watch the deployment
kubectl get pods -n prometheus -w
|
You should see all components starting up:
1
2
3
4
5
6
7
| NAME READY STATUS RESTARTS AGE
prometheus-server-0 1/1 Running 0 2m
prometheus-alertmanager-0 1/1 Running 0 2m
prometheus-alertmanager-1 1/1 Running 0 2m
prometheus-node-exporter-xxxxx 1/1 Running 0 2m
prometheus-kube-state-metrics-xxxxxxxxx-xxxxx 1/1 Running 0 2m
prometheus-pushgateway-xxxxxxxxx-xxxxx 1/1 Running 0 2m
|
Alert rules are the backbone of proactive monitoring. Let’s configure comprehensive alerts.
2.1 Node and Infrastructure Alerts
Add to your values.yml under serverFiles:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| serverFiles:
alerting_rules.yml:
groups:
- name: node_alerts
rules:
# Critical: Node completely down
- alert: NodeDown
expr: up{job="prometheus-node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 5 minutes"
# Warning: High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 10 minutes (current: {{ $value | humanize }}%)"
# Warning: High memory usage
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current: {{ $value | humanize }}%)"
# Warning: Low disk space
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 15% (current: {{ $value | humanize }}%)"
# Critical: Very low disk space
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk space is below 10% (current: {{ $value | humanize }}%)"
|
2.2 Kubernetes Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| - name: kubernetes_alerts
rules:
# Critical: Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
# Warning: Pod not ready
- alert: PodNotReady
expr: kube_pod_status_phase{phase!~"Running|Succeeded"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
description: "Pod has been in {{ $labels.phase }} state for more than 15 minutes"
# Warning: Deployment replicas mismatch
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 15m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has mismatched replicas"
description: "Desired: {{ $labels.spec_replicas }}, Available: {{ $labels.status_replicas_available }}"
# Warning: PersistentVolume space low
- alert: PersistentVolumeSpaceLow
expr: (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} space low"
description: "PVC has less than 15% space available (current: {{ $value | humanize }}%)"
|
2.3 Prometheus Self-Monitoring Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| - name: prometheus_alerts
rules:
# Critical: Config reload failed
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus config reload failed"
description: "Prometheus {{ $labels.instance }} config reload has failed"
# Warning: TSDB compactions failing
- alert: PrometheusTSDBCompactionsFailed
expr: rate(prometheus_tsdb_compactions_failed_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus TSDB compactions failing"
# Critical: Not connected to Alertmanager
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus not connected to Alertmanager"
|
Recording rules pre-compute frequently used queries:
1
2
3
4
5
6
7
8
9
10
11
12
13
| recording_rules.yml:
groups:
- name: node_recording_rules
interval: 30s
rules:
- record: instance:node_cpu_utilization:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory_utilization:ratio
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
- record: instance:node_disk_utilization:ratio
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes
|
Step 3: Secure Prometheus with Vault-Backed Authentication
One of the biggest security mistakes is exposing Prometheus without authentication. Let’s fix that with a production-grade solution.
3.1 The Security Architecture
We’ll implement defense-in-depth with three layers:
- HashiCorp Vault: Secure credential storage
- External Secrets Operator: Sync credentials to Kubernetes
- Traefik Middleware: Enforce authentication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| ┌──────────────┐
│ Vault │ ← Credentials stored securely
└──────┬───────┘
│
│ External Secrets Operator syncs
▼
┌──────────────┐
│ K8s Secret │ ← Auto-synced every 5 minutes
└──────┬───────┘
│
│ Referenced by Traefik Middleware
▼
┌──────────────┐
│ Middleware │ ← Enforces Basic Auth
└──────┬───────┘
│
│ Applied to Ingress
▼
┌──────────────┐
│ Prometheus │ ← Protected!
└──────────────┘
|
3.2 Generate Authentication Credentials
1
2
3
4
5
| # Generate htpasswd hash
htpasswd -nb admin YourSecurePassword123!
# Output: admin:$apr1$xyz123abc$defgh456...
# Copy the full output, you'll need it next
|
3.3 Store Credentials in Vault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Connect to Vault
kubectl exec -it vault-0 -n vault -- sh
# Login to Vault
vault login
# Enter your root token
# Store the credentials
vault kv put kv/infra/prometheus-basic-auth \
users='admin:$apr1$xyz123abc$defgh456...'
# Verify
vault kv get kv/infra/prometheus-basic-auth
# Exit
exit
|
3.4 Create ExternalSecret Resource
Create external-secret.yml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: prometheus-basic-auth
namespace: prometheus
spec:
refreshInterval: 5m # Sync from Vault every 5 minutes
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: prometheus-basic-auth
creationPolicy: Owner
data:
- secretKey: users
remoteRef:
key: infra/prometheus-basic-auth
property: users
|
Apply it:
1
2
3
4
5
6
7
8
| kubectl apply -f external-secret.yml
# Verify the secret is synced
kubectl get externalsecret prometheus-basic-auth -n prometheus
# Should show: STATUS: SecretSynced
kubectl get secret prometheus-basic-auth -n prometheus
# Should exist
|
3.5 Create Traefik Middleware
Create auth-middleware.yml:
1
2
3
4
5
6
7
8
9
10
| ---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: prometheus-auth
namespace: prometheus
spec:
basicAuth:
secret: prometheus-basic-auth
removeHeader: true # Don't pass auth header to backend
|
Apply it:
1
2
3
4
| kubectl apply -f auth-middleware.yml
# Verify
kubectl get middleware -n prometheus
|
3.6 Verify Authentication
1
2
3
4
5
6
7
| # Test without credentials - should return 401
curl -I https://prometheus.example.com
# HTTP/2 401
# Test with credentials - should return 200
curl -I -u admin:YourSecurePassword123! https://prometheus.example.com
# HTTP/2 200
|
Success! Your Prometheus is now protected with Vault-backed authentication.
3.7 Rotating Credentials
To rotate credentials:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Generate new password hash
htpasswd -nb admin NewSecurePassword456!
# Update in Vault
kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/prometheus-basic-auth \
users='admin:$apr1$new-hash...'
exit
# ExternalSecret will auto-sync within 5 minutes
# Or force immediate sync:
kubectl delete secret prometheus-basic-auth -n prometheus
# External Secrets Operator recreates it immediately
|
Alertmanager handles deduplication, grouping, and routing of alerts.
Add to values.yml under alertmanagerFiles:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| alertmanagerFiles:
alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 10s # Wait before sending first notification
group_interval: 10s # Wait before sending batch of new alerts
repeat_interval: 12h # Re-send after this time
receiver: "default"
routes:
# Critical alerts to immediate notification
- match:
severity: critical
receiver: "critical"
continue: true
# Warnings to different channel
- match:
severity: warning
receiver: "warning"
receivers:
- name: "default"
# Default receiver configuration
- name: "critical"
# Example: Slack for critical alerts
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
# channel: '#critical-alerts'
# title: 'Critical Alert'
# text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}'
- name: "warning"
# Example: Slack for warnings
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
# channel: '#warnings'
# title: 'Warning Alert'
# Inhibit rules: Suppress lower severity alerts when higher ones fire
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
|
To send alerts to Slack:
- Create a Slack webhook URL
- Store it in Vault:
1
2
3
4
5
| kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/alertmanager-slack \
webhook_url='https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
exit
|
- Create ExternalSecret:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: alertmanager-slack
namespace: prometheus
spec:
refreshInterval: 5m
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: alertmanager-slack
data:
- secretKey: webhook_url
remoteRef:
key: infra/alertmanager-slack
property: webhook_url
|
- Reference in Alertmanager config:
1
2
3
4
5
| receivers:
- name: "critical"
slack_configs:
- api_url_file: /etc/alertmanager/secrets/alertmanager-slack/webhook_url
channel: "#critical-alerts"
|
Step 5: Implement Network Policies for Zero-Trust Security
Network policies ensure pods can only communicate with authorized services.
5.1 Prometheus Server Network Policy
Create network-policy.yml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
| ---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-server-network-policy
namespace: prometheus
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus-server
policyTypes:
- Ingress
- Egress
ingress:
# Allow from Grafana
- from:
- namespaceSelector:
matchLabels:
name: grafana
ports:
- protocol: TCP
port: 9090
# Allow from Traefik ingress
- from:
- namespaceSelector:
matchLabels:
name: traefik
ports:
- protocol: TCP
port: 9090
# Allow internal communication
- from:
- podSelector: {}
ports:
- protocol: TCP
port: 9090
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow Kubernetes API
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
# Allow scraping exporters
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: prometheus-node-exporter
ports:
- protocol: TCP
port: 9100
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
ports:
- protocol: TCP
port: 8080
# Allow Alertmanager
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: prometheus-alertmanager
ports:
- protocol: TCP
port: 9093
# Allow scraping pods across namespaces
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 8080
- protocol: TCP
port: 9090
|
5.2 Apply Network Policies
1
2
3
4
| kubectl apply -f network-policy.yml
# Verify
kubectl get networkpolicy -n prometheus
|
Important: Ensure your namespace has the correct labels:
1
2
3
| kubectl label namespace grafana name=grafana
kubectl label namespace traefik name=traefik
kubectl label namespace kube-system name=kube-system
|
Step 6: Verification and Testing
6.1 Verify All Components
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Check pods
kubectl get pods -n prometheus
# Check PVCs
kubectl get pvc -n prometheus
# Check ingress
kubectl get ingress -n prometheus
# Check secrets
kubectl get secrets -n prometheus
# Check middleware
kubectl get middleware -n prometheus
|
6.2 Access Prometheus UI
Navigate to https://prometheus.example.com and login with your credentials.
6.3 Verify Metrics Collection
Run these queries in the Prometheus UI:
1
2
3
4
5
6
7
8
9
10
11
| # Check all targets are up
up
# Node CPU usage
instance:node_cpu_utilization:rate5m
# Pod count by namespace
count by(namespace) (kube_pod_info)
# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
6.4 Test Alerting
Create a test alert:
1
2
3
4
5
6
7
8
9
| # Trigger a test alert by creating a crashlooping pod
kubectl run crashpod --image=busybox -- /bin/sh -c "exit 1"
# Check alerts in Prometheus UI
# Navigate to: Alerts tab
# Check Alertmanager
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
# Open: http://localhost:9093
|
Troubleshooting Guide
Issue 1: Authentication Not Working
Symptoms: Getting 401 even with correct credentials
Solutions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Check middleware exists
kubectl get middleware prometheus-auth -n prometheus
# Check secret content
kubectl get secret prometheus-basic-auth -n prometheus -o jsonpath='{.data.users}' | base64 -d
# Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus
# Verify Vault has credentials
kubectl exec -it vault-0 -n vault -- vault kv get kv/infra/prometheus-basic-auth
# Check ingress annotations
kubectl get ingress -n prometheus -o yaml | grep middleware
|
Issue 2: High Memory Usage
Symptoms: Prometheus pod getting OOMKilled
Solutions:
1
2
3
4
5
6
7
8
9
10
11
| # Check current usage
kubectl top pods -n prometheus
# Check cardinality (high cardinality = high memory)
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/status/tsdb | jq .
# Solutions:
# 1. Increase memory limits in values.yml
# 2. Reduce retention period
# 3. Drop high-cardinality metrics with relabel configs
|
Issue 3: Targets Not Scraped
Symptoms: Targets showing as down in Prometheus UI
Solutions:
1
2
3
4
5
6
7
8
9
10
11
12
| # Check service discovery
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health!="up")'
# Check RBAC permissions
kubectl get clusterrole prometheus-server -o yaml
# Check network policies
kubectl get networkpolicy -n prometheus -o yaml
# Test connectivity
kubectl exec -it prometheus-server-0 -n prometheus -- wget -O- http://prometheus-node-exporter:9100/metrics
|
Issue 4: PVC Not Binding
Symptoms: Pod stuck in Pending, PVC not bound
Solutions:
1
2
3
4
5
6
7
8
9
10
11
12
| # Check PVC status
kubectl get pvc -n prometheus
kubectl describe pvc <pvc-name> -n prometheus
# Check storage class
kubectl get storageclass
# Verify storage provisioner is running
kubectl get pods -n kube-system | grep provisioner
# Manual fix: Delete PVC and let it recreate
kubectl delete pvc -n prometheus <pvc-name>
|
Issue 5: Alertmanager Not Receiving Alerts
Symptoms: Alerts firing in Prometheus but not in Alertmanager
Solutions:
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Check Prometheus alerting config
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/alertmanagers
# Check Alertmanager status
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
curl http://localhost:9093/api/v1/status
# Check alert rules
curl http://localhost:9090/api/v1/rules
# Check Alertmanager logs
kubectl logs -n prometheus prometheus-alertmanager-0 -f
|
Issue 6: External Secrets Not Syncing
Symptoms: Secret not created by ExternalSecret
Solutions:
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus
# Check ClusterSecretStore
kubectl get clustersecretstore vault-backend
kubectl describe clustersecretstore vault-backend
# Check External Secrets Operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
# Test Vault connectivity
EXTERNAL_SECRETS_POD=$(kubectl get pod -n external-secrets -l app.kubernetes.io/name=external-secrets -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n external-secrets $EXTERNAL_SECRETS_POD -- wget -O- http://vault.vault:8200/v1/sys/health
|
Security Best Practices
1. Credential Management
✅ DO:
- Store all credentials in Vault
- Use External Secrets Operator for K8s secret sync
- Rotate credentials regularly (quarterly minimum)
- Use strong, randomly generated passwords
❌ DON’T:
- Commit credentials to git
- Use default passwords
- Share credentials across environments
- Store credentials in plain text ConfigMaps
2. Network Security
✅ DO:
- Implement network policies for all components
- Use namespace isolation
- Restrict egress to necessary destinations
- Label namespaces for policy enforcement
❌ DON’T:
- Allow unrestricted pod-to-pod communication
- Expose Alertmanager publicly without auth
- Allow broad egress to internet
3. Resource Management
✅ DO:
- Set resource requests and limits
- Use PodDisruptionBudgets
- Monitor resource usage
- Scale based on metrics
❌ DON’T:
- Run without resource limits
- Ignore OOMKilled pods
- Over-provision resources
4. Data Retention
✅ DO:
- Use persistent volumes
- Configure appropriate retention
- Implement backup strategy
- Monitor storage usage
❌ DON’T:
- Rely on ephemeral storage
- Set unlimited retention
- Ignore disk space alerts
5. Alert Configuration
✅ DO:
- Use severity labels
- Implement inhibit rules
- Test alerts regularly
- Document runbooks
❌ DON’T:
- Alert on everything
- Ignore alert fatigue
- Skip testing alerts
- Forget to document remediation
1
2
3
4
5
| server:
extraArgs:
query.max-concurrency: 20
query.timeout: 2m
query.lookback-delta: 5m
|
1
2
3
4
| server:
extraArgs:
storage.tsdb.min-block-duration: 2h
storage.tsdb.max-block-duration: 2h
|
Reduce Cardinality
Use relabeling to drop unnecessary labels:
1
2
3
4
5
6
7
8
9
| server:
global:
scrape_configs:
- job_name: "kubernetes-pods"
relabel_configs:
# Drop high-cardinality labels
- source_labels: [__name__]
regex: "go_.*"
action: drop
|
Integration with Grafana
Prometheus integrates seamlessly with Grafana for visualization.
Add Prometheus as Datasource
In Grafana’s datasource configuration:
1
2
3
4
5
6
7
8
9
10
| apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server.prometheus:80
isDefault: true
jsonData:
timeInterval: 30s
|
Recommended Dashboards
Import these community dashboards:
- 1860 - Node Exporter Full (comprehensive node metrics)
- 7249 - Kubernetes Cluster Monitoring
- 6417 - Kubernetes Pod Monitoring
- 315 - Kubernetes Cluster (Prometheus)
Monitoring Your Monitoring
Don’t forget to monitor Prometheus itself!
Key Metrics to Watch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Prometheus memory usage
process_resident_memory_bytes
# Prometheus CPU usage
rate(process_cpu_seconds_total[5m])
# Time series count (cardinality)
prometheus_tsdb_symbol_table_size_bytes
# Query duration
histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket[5m]))
# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])
|
Consider a separate monitoring cluster to monitor your production Prometheus:
1
2
3
4
5
6
7
8
9
10
11
12
13
| ┌─────────────────┐
│ Prod Cluster │
│ - Prometheus │───┐
│ - Applications │ │
└─────────────────┘ │
│ Federation or
│ Remote Write
▼
┌──────────────┐
│ Meta Monitor │
│ - Prometheus │
│ - Grafana │
└──────────────┘
|
Backup and Disaster Recovery
Manual Backup
1
2
3
4
5
6
7
8
| # Create snapshot
PROMETHEUS_POD=$(kubectl get pod -n prometheus -l app.kubernetes.io/name=prometheus-server -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n prometheus $PROMETHEUS_POD -- \
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Copy snapshot
kubectl cp prometheus/$PROMETHEUS_POD:/data/snapshots/<timestamp> ./prometheus-backup
|
Automated Backup with Velero
1
2
3
4
5
6
7
8
9
10
11
| # Install Velero
velero install --provider aws --bucket prometheus-backups
# Create backup schedule
velero schedule create prometheus-daily \
--schedule="0 2 * * *" \
--include-namespaces prometheus \
--ttl 720h
# Restore from backup
velero restore create --from-backup prometheus-daily-<timestamp>
|
Configuration Backup
Always backup your configuration:
1
2
3
4
5
6
7
8
| # Backup all configuration
kubectl get -n prometheus \
configmap,secret,externalsecret,middleware,ingress,pvc \
-o yaml > prometheus-config-backup.yaml
# Store in git (excluding secrets!)
git add helmfile.yaml values.yml external-secret.yml auth-middleware.yml network-policy.yml
git commit -m "Backup Prometheus configuration"
|
Scaling Considerations
Vertical Scaling
For larger clusters, increase resources:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| server:
resources:
limits:
cpu: 4000m
memory: 8Gi
requests:
cpu: 1000m
memory: 4Gi
persistentVolume:
size: 100Gi
retention: "90d"
retentionSize: "95GB"
|
Horizontal Scaling with Federation
For multi-cluster deployments, use federation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # On central Prometheus
server:
extraScrapeConfigs: |
- job_name: 'federate-clusters'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~".+"}'
static_configs:
- targets:
- 'prometheus-cluster1.example.com'
- 'prometheus-cluster2.example.com'
|
Remote Storage
For long-term retention, use remote storage:
1
2
3
4
5
6
7
8
9
| server:
remoteWrite:
- url: "http://cortex.monitoring:9009/api/prom/push"
queueConfig:
capacity: 10000
maxShards: 200
minShards: 1
maxSamplesPerSend: 500
batchSendDeadline: 5s
|
Cost Optimization
Reduce Storage Costs
- Tune retention:
1
2
| retention: "15d" # Instead of 30d
retentionSize: "20GB" # Instead of 45GB
|
- Drop unnecessary metrics:
1
2
3
4
| metric_relabel_configs:
- source_labels: [__name__]
regex: "(go_.*|process_.*)"
action: drop
|
- Use remote storage with compression
Reduce Compute Costs
- Use recording rules for expensive queries
- Adjust scrape intervals:
1
2
3
| global:
scrape_interval: 60sess_.*)'
action: drop
|
- Use remote storage with compression
Reduce Compute Costs
- Use recording rules for expensive queries
- Adjust scrape intervals:
1
2
| global:
scrape_interval: 60s # Instead of 30s for non-critical metrics
|
- Right-size resources based on actual usage
Conclusion
You now have a production-ready Prometheus deployment with:
✅ Secure authentication via Vault and Traefik
✅ High availability with StatefulSets
✅ Persistent storage for data retention
✅ Comprehensive alerting rules
✅ Network policies for zero-trust security
✅ Resource limits to prevent exhaustion
✅ Integration with Grafana
✅ Backup and disaster recovery strategies
Next Steps
- Configure Alertmanager receivers (Slack, PagerDuty, email)
- Create Grafana dashboards for your applications
- Set up ServiceMonitors for application metrics
- Implement federation for multi-cluster monitoring
- Configure remote storage for long-term retention
- Test disaster recovery procedures
Key Takeaways
- Security is not optional: Always implement authentication and network policies
- Persistence matters: Use persistent volumes to prevent data loss
- Monitor your monitoring: Set up alerts for Prometheus itself
- Start simple, iterate: Begin with basic alerts and refine based on experience
- Document everything: Runbooks save time during incidents
- Test regularly: Verify backups and disaster recovery procedures
Resources
About the Author
This guide is based on real-world production deployments at scale. For questions or suggestions, feel free to reach out!
Tags: #kubernetes #prometheus #monitoring #observability #devops #security #vault #traefik #helm #production
Last Updated: January 30, 2026