Production-Ready Prometheus on Kubernetes: Complete Guide with Vault Authentication and High Availability

Introduction

Prometheus has become the de facto standard for monitoring Kubernetes clusters and cloud-native applications. While getting Prometheus running is straightforward, setting it up for production with proper security, authentication, persistence, and alerting requires careful planning.

In this comprehensive guide, I’ll walk you through deploying a production-ready Prometheus stack on Kubernetes with:

Secure authentication using HashiCorp Vault and Traefik middleware
High availability configuration with StatefulSets
Persistent storage for long-term metrics retention
Comprehensive alerting rules for infrastructure and application monitoring
Network policies for zero-trust security
Resource management to prevent resource exhaustion
Best practices for production deployments

By the end of this guide, you’ll have a robust monitoring solution that’s secure, scalable, and ready for production workloads.

Why Production-Ready Prometheus Matters

Many teams start with a basic Prometheus deployment, only to encounter issues in production:

Data loss from lack of persistent storage
Security vulnerabilities from exposed, unauthenticated endpoints
Resource exhaustion from missing resource limits
Alert fatigue from poorly configured alerting rules
Network security gaps without proper network policies

This guide addresses all these concerns with battle-tested configurations.

Architecture Overview

Our Prometheus deployment consists of several components:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
┌─────────────────────────────────────────────────────────────┐
│                     Internet / Users                         │
└───────────────────────────┬─────────────────────────────────┘
                            │ HTTPS (TLS)
                            ▼
                ┌───────────────────────┐
                │  Traefik Ingress      │
                │  + Basic Auth         │
                │  + TLS Termination    │
                └───────────┬───────────┘
                            │
                            ▼
            ┌───────────────────────────────┐
            │   Prometheus Server           │
            │   - Metrics Collection        │
            │   - Query Engine              │
            │   - TSDB Storage             │
            │   - Alert Evaluation          │
            └───────┬──────────┬────────────┘
                    │          │
        ┌───────────┘          └────────────┐
        │                                   │
        ▼                                   ▼
┌───────────────┐                  ┌──────────────┐
│ Alertmanager  │                  │   Exporters  │
│ - Alert       │                  │              │
│   Routing     │                  │ • Node       │
│ - Dedup       │                  │ • Kube State │
│ - Silencing   │                  │ • Pushgateway│
└───────────────┘                  └──────────────┘
        │
        ▼
┌────────────────┐
│ Alert Channels │
│ (Slack, Email) │
└────────────────┘

Components

Prometheus Server: Core time-series database and query engine
Alertmanager: Alert deduplication, grouping, and routing
Node Exporter: Hardware and OS metrics (DaemonSet on every node)
Kube State Metrics: Kubernetes API object metrics
Pushgateway: Metrics collection for short-lived jobs
Traefik Middleware: Authentication layer
External Secrets: Secure credential management via Vault

Prerequisites

Before we begin, ensure you have:

Infrastructure Requirements

Kubernetes cluster (v1.20+)
kubectl configured and authenticated
Helm v3 or Helmfile installed
Storage class for persistent volumes (50Gi for Prometheus, 2Gi for Alertmanager)

Required Add-ons

Traefik ingress controller
cert-manager with Let’s Encrypt issuer configured
HashiCorp Vault deployed and configured
External Secrets Operator installed

Tools

1
2
3
4
5
6
7
# Install required CLI tools
brew install kubernetes-cli helm helmfile apache2-utils vault

# Verify versions
kubectl version --client
helm version
helmfile version

Step 1: Deploy Prometheus with Helmfile

We’ll use Helmfile for declarative, version-controlled deployments.

1.1 Create Helmfile Configuration

Create helmfile.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
helmDefaults:
  createNamespace: true
  timeout: 300
  wait: false

repositories:
  - name: prometheus-community
    url: https://prometheus-community.github.io/helm-charts

releases:
  - name: prometheus
    namespace: prometheus
    chart: prometheus-community/prometheus
    version: "28.6.0"
    values:
      - ./values.yml

1.2 Create Comprehensive Values Configuration

Create values.yml with production-ready settings:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Prometheus Server Configuration
server:
  baseUrl: "https://prometheus.example.com"

  replicaCount: 1
  statefulSet:
    enabled: true # Use StatefulSet for stable network identity

  # Resource Management - Critical for stability
  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 500m
      memory: 1Gi

  # Persistent Storage - CRITICAL for data retention
  persistentVolume:
    enabled: true
    size: 50Gi
    # storageClass: ""  # Use default storage class

  # Data Retention
  retention: "30d" # Keep metrics for 30 days
  retentionSize: "45GB" # Max storage size

  # Security Context - Run as non-root
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  # High Availability
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

  # Anti-affinity for spreading pods across nodes
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - prometheus-server
            topologyKey: kubernetes.io/hostname

  # Ingress Configuration
  ingress:
    enabled: true
    ingressClassName: traefik
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod-issuer
      traefik.ingress.kubernetes.io/router.middlewares: prometheus-prometheus-auth@kubernetescrd
    hosts:
      - prometheus.example.com
    tls:
      - secretName: prometheus-tls
        hosts:
          - prometheus.example.com

  # Global Prometheus Configuration
  global:
    scrape_interval: 30s
    scrape_timeout: 10s
    evaluation_interval: 30s
    external_labels:
      cluster: "production"
      environment: "production"

  # Enable ServiceMonitor support
  serviceMonitor:
    enabled: true

# Alertmanager Configuration
alertmanager:
  enabled: true
  replicaCount: 2 # HA setup

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

  persistentVolume:
    enabled: true
    size: 2Gi

  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  podDisruptionBudget:
    enabled: true
    minAvailable: 1

# Node Exporter - Metrics from every node
nodeExporter:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 50Mi
    requests:
      cpu: 50m
      memory: 30Mi

  # Host networking for accurate metrics
  hostNetwork: true
  hostPID: true

# Kube State Metrics - Kubernetes object metrics
kubeStateMetrics:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

# Pushgateway - For batch jobs
pushgateway:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

  persistentVolume:
    enabled: true
    size: 2Gi

1.3 Deploy Prometheus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Navigate to your prometheus directory
cd k8s/releases/prometheus

# Preview what will be deployed
helmfile template

# Deploy Prometheus
helmfile sync

# Watch the deployment
kubectl get pods -n prometheus -w

You should see all components starting up:

1
2
3
4
5
6
7
NAME                                             READY   STATUS    RESTARTS   AGE
prometheus-server-0                              1/1     Running   0          2m
prometheus-alertmanager-0                        1/1     Running   0          2m
prometheus-alertmanager-1                        1/1     Running   0          2m
prometheus-node-exporter-xxxxx                   1/1     Running   0          2m
prometheus-kube-state-metrics-xxxxxxxxx-xxxxx    1/1     Running   0          2m
prometheus-pushgateway-xxxxxxxxx-xxxxx           1/1     Running   0          2m

Step 2: Configure Production-Grade Alert Rules

Alert rules are the backbone of proactive monitoring. Let’s configure comprehensive alerts.

2.1 Node and Infrastructure Alerts

Add to your values.yml under serverFiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
serverFiles:
  alerting_rules.yml:
    groups:
      - name: node_alerts
        rules:
          # Critical: Node completely down
          - alert: NodeDown
            expr: up{job="prometheus-node-exporter"} == 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Node {{ $labels.instance }} is down"
              description: "Node {{ $labels.instance }} has been down for more than 5 minutes"

          # Warning: High CPU usage
          - alert: HighCPUUsage
            expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is above 80% for 10 minutes (current: {{ $value | humanize }}%)"

          # Warning: High memory usage
          - alert: HighMemoryUsage
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is above 85% (current: {{ $value | humanize }}%)"

          # Warning: Low disk space
          - alert: DiskSpaceLow
            expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Disk space is below 15% (current: {{ $value | humanize }}%)"

          # Critical: Very low disk space
          - alert: DiskSpaceCritical
            expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
              description: "Disk space is below 10% (current: {{ $value | humanize }}%)"

2.2 Kubernetes Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
- name: kubernetes_alerts
  rules:
    # Critical: Pod crash looping
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"

    # Warning: Pod not ready
    - alert: PodNotReady
      expr: kube_pod_status_phase{phase!~"Running|Succeeded"} == 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
        description: "Pod has been in {{ $labels.phase }} state for more than 15 minutes"

    # Warning: Deployment replicas mismatch
    - alert: DeploymentReplicasMismatch
      expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has mismatched replicas"
        description: "Desired: {{ $labels.spec_replicas }}, Available: {{ $labels.status_replicas_available }}"

    # Warning: PersistentVolume space low
    - alert: PersistentVolumeSpaceLow
      expr: (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 15
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} space low"
        description: "PVC has less than 15% space available (current: {{ $value | humanize }}%)"

2.3 Prometheus Self-Monitoring Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
- name: prometheus_alerts
  rules:
    # Critical: Config reload failed
    - alert: PrometheusConfigReloadFailed
      expr: prometheus_config_last_reload_successful == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Prometheus config reload failed"
        description: "Prometheus {{ $labels.instance }} config reload has failed"

    # Warning: TSDB compactions failing
    - alert: PrometheusTSDBCompactionsFailed
      expr: rate(prometheus_tsdb_compactions_failed_total[5m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Prometheus TSDB compactions failing"

    # Critical: Not connected to Alertmanager
    - alert: PrometheusNotConnectedToAlertmanager
      expr: prometheus_notifications_alertmanagers_discovered < 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Prometheus not connected to Alertmanager"

2.4 Recording Rules for Performance

Recording rules pre-compute frequently used queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
recording_rules.yml:
  groups:
    - name: node_recording_rules
      interval: 30s
      rules:
        - record: instance:node_cpu_utilization:rate5m
          expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

        - record: instance:node_memory_utilization:ratio
          expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

        - record: instance:node_disk_utilization:ratio
          expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes

Step 3: Secure Prometheus with Vault-Backed Authentication

One of the biggest security mistakes is exposing Prometheus without authentication. Let’s fix that with a production-grade solution.

3.1 The Security Architecture

We’ll implement defense-in-depth with three layers:

HashiCorp Vault: Secure credential storage
External Secrets Operator: Sync credentials to Kubernetes
Traefik Middleware: Enforce authentication

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
┌──────────────┐
│  Vault       │  ← Credentials stored securely
└──────┬───────┘
       │
       │ External Secrets Operator syncs
       ▼
┌──────────────┐
│  K8s Secret  │  ← Auto-synced every 5 minutes
└──────┬───────┘
       │
       │ Referenced by Traefik Middleware
       ▼
┌──────────────┐
│  Middleware  │  ← Enforces Basic Auth
└──────┬───────┘
       │
       │ Applied to Ingress
       ▼
┌──────────────┐
│  Prometheus  │  ← Protected!
└──────────────┘

3.2 Generate Authentication Credentials

1
2
3
4
5
# Generate htpasswd hash
htpasswd -nb admin YourSecurePassword123!
# Output: admin:$apr1$xyz123abc$defgh456...

# Copy the full output, you'll need it next

3.3 Store Credentials in Vault

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Connect to Vault
kubectl exec -it vault-0 -n vault -- sh

# Login to Vault
vault login
# Enter your root token

# Store the credentials
vault kv put kv/infra/prometheus-basic-auth \
    users='admin:$apr1$xyz123abc$defgh456...'

# Verify
vault kv get kv/infra/prometheus-basic-auth

# Exit
exit

3.4 Create ExternalSecret Resource

Create external-secret.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: prometheus-basic-auth
  namespace: prometheus
spec:
  refreshInterval: 5m # Sync from Vault every 5 minutes
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: prometheus-basic-auth
    creationPolicy: Owner
  data:
    - secretKey: users
      remoteRef:
        key: infra/prometheus-basic-auth
        property: users

Apply it:

1
2
3
4
5
6
7
8
kubectl apply -f external-secret.yml

# Verify the secret is synced
kubectl get externalsecret prometheus-basic-auth -n prometheus
# Should show: STATUS: SecretSynced

kubectl get secret prometheus-basic-auth -n prometheus
# Should exist

3.5 Create Traefik Middleware

Create auth-middleware.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: prometheus-auth
  namespace: prometheus
spec:
  basicAuth:
    secret: prometheus-basic-auth
    removeHeader: true # Don't pass auth header to backend

Apply it:

1
2
3
4
kubectl apply -f auth-middleware.yml

# Verify
kubectl get middleware -n prometheus

3.6 Verify Authentication

1
2
3
4
5
6
7
# Test without credentials - should return 401
curl -I https://prometheus.example.com
# HTTP/2 401

# Test with credentials - should return 200
curl -I -u admin:YourSecurePassword123! https://prometheus.example.com
# HTTP/2 200

Success! Your Prometheus is now protected with Vault-backed authentication.

3.7 Rotating Credentials

To rotate credentials:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Generate new password hash
htpasswd -nb admin NewSecurePassword456!

# Update in Vault
kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/prometheus-basic-auth \
    users='admin:$apr1$new-hash...'
exit

# ExternalSecret will auto-sync within 5 minutes
# Or force immediate sync:
kubectl delete secret prometheus-basic-auth -n prometheus
# External Secrets Operator recreates it immediately

Step 4: Configure Alertmanager for Alert Routing

Alertmanager handles deduplication, grouping, and routing of alerts.

4.1 Configure Alert Routing

Add to values.yml under alertmanagerFiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
alertmanagerFiles:
  alertmanager.yml:
    global:
      resolve_timeout: 5m

    route:
      group_by: ["alertname", "cluster", "service"]
      group_wait: 10s # Wait before sending first notification
      group_interval: 10s # Wait before sending batch of new alerts
      repeat_interval: 12h # Re-send after this time
      receiver: "default"
      routes:
        # Critical alerts to immediate notification
        - match:
            severity: critical
          receiver: "critical"
          continue: true
        # Warnings to different channel
        - match:
            severity: warning
          receiver: "warning"

    receivers:
      - name: "default"
        # Default receiver configuration

      - name: "critical"
        # Example: Slack for critical alerts
        # slack_configs:
        #   - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        #     channel: '#critical-alerts'
        #     title: 'Critical Alert'
        #     text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}'

      - name: "warning"
        # Example: Slack for warnings
        # slack_configs:
        #   - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        #     channel: '#warnings'
        #     title: 'Warning Alert'

    # Inhibit rules: Suppress lower severity alerts when higher ones fire
    inhibit_rules:
      - source_match:
          severity: "critical"
        target_match:
          severity: "warning"
        equal: ["alertname", "instance"]

4.2 Configure Slack Integration (Example)

To send alerts to Slack:

Create a Slack webhook URL
Store it in Vault:

1
2
3
4
5
kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/alertmanager-slack \
    webhook_url='https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
exit

Create ExternalSecret:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: alertmanager-slack
  namespace: prometheus
spec:
  refreshInterval: 5m
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: alertmanager-slack
  data:
    - secretKey: webhook_url
      remoteRef:
        key: infra/alertmanager-slack
        property: webhook_url

Reference in Alertmanager config:

1
2
3
4
5
receivers:
  - name: "critical"
    slack_configs:
      - api_url_file: /etc/alertmanager/secrets/alertmanager-slack/webhook_url
        channel: "#critical-alerts"

Step 5: Implement Network Policies for Zero-Trust Security

Network policies ensure pods can only communicate with authorized services.

5.1 Prometheus Server Network Policy

Create network-policy.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-server-network-policy
  namespace: prometheus
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: prometheus-server
  policyTypes:
    - Ingress
    - Egress

  ingress:
    # Allow from Grafana
    - from:
        - namespaceSelector:
            matchLabels:
              name: grafana
      ports:
        - protocol: TCP
          port: 9090

    # Allow from Traefik ingress
    - from:
        - namespaceSelector:
            matchLabels:
              name: traefik
      ports:
        - protocol: TCP
          port: 9090

    # Allow internal communication
    - from:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 9090

  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53

    # Allow Kubernetes API
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443

    # Allow scraping exporters
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus-node-exporter
      ports:
        - protocol: TCP
          port: 9100

    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: kube-state-metrics
      ports:
        - protocol: TCP
          port: 8080

    # Allow Alertmanager
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus-alertmanager
      ports:
        - protocol: TCP
          port: 9093

    # Allow scraping pods across namespaces
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 9090

5.2 Apply Network Policies

1
2
3
4
kubectl apply -f network-policy.yml

# Verify
kubectl get networkpolicy -n prometheus

Important: Ensure your namespace has the correct labels:

1
2
3
kubectl label namespace grafana name=grafana
kubectl label namespace traefik name=traefik
kubectl label namespace kube-system name=kube-system

Step 6: Verification and Testing

6.1 Verify All Components

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Check pods
kubectl get pods -n prometheus

# Check PVCs
kubectl get pvc -n prometheus

# Check ingress
kubectl get ingress -n prometheus

# Check secrets
kubectl get secrets -n prometheus

# Check middleware
kubectl get middleware -n prometheus

6.2 Access Prometheus UI

Navigate to https://prometheus.example.com and login with your credentials.

6.3 Verify Metrics Collection

Run these queries in the Prometheus UI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Check all targets are up
up

# Node CPU usage
instance:node_cpu_utilization:rate5m

# Pod count by namespace
count by(namespace) (kube_pod_info)

# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

6.4 Test Alerting

Create a test alert:

1
2
3
4
5
6
7
8
9
# Trigger a test alert by creating a crashlooping pod
kubectl run crashpod --image=busybox -- /bin/sh -c "exit 1"

# Check alerts in Prometheus UI
# Navigate to: Alerts tab

# Check Alertmanager
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
# Open: http://localhost:9093

Troubleshooting Guide

Issue 1: Authentication Not Working

Symptoms: Getting 401 even with correct credentials

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Check middleware exists
kubectl get middleware prometheus-auth -n prometheus

# Check secret content
kubectl get secret prometheus-basic-auth -n prometheus -o jsonpath='{.data.users}' | base64 -d

# Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus

# Verify Vault has credentials
kubectl exec -it vault-0 -n vault -- vault kv get kv/infra/prometheus-basic-auth

# Check ingress annotations
kubectl get ingress -n prometheus -o yaml | grep middleware

Issue 2: High Memory Usage

Symptoms: Prometheus pod getting OOMKilled

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Check current usage
kubectl top pods -n prometheus

# Check cardinality (high cardinality = high memory)
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/status/tsdb | jq .

# Solutions:
# 1. Increase memory limits in values.yml
# 2. Reduce retention period
# 3. Drop high-cardinality metrics with relabel configs

Issue 3: Targets Not Scraped

Symptoms: Targets showing as down in Prometheus UI

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check service discovery
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health!="up")'

# Check RBAC permissions
kubectl get clusterrole prometheus-server -o yaml

# Check network policies
kubectl get networkpolicy -n prometheus -o yaml

# Test connectivity
kubectl exec -it prometheus-server-0 -n prometheus -- wget -O- http://prometheus-node-exporter:9100/metrics

Issue 4: PVC Not Binding

Symptoms: Pod stuck in Pending, PVC not bound

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check PVC status
kubectl get pvc -n prometheus
kubectl describe pvc <pvc-name> -n prometheus

# Check storage class
kubectl get storageclass

# Verify storage provisioner is running
kubectl get pods -n kube-system | grep provisioner

# Manual fix: Delete PVC and let it recreate
kubectl delete pvc -n prometheus <pvc-name>

Issue 5: Alertmanager Not Receiving Alerts

Symptoms: Alerts firing in Prometheus but not in Alertmanager

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check Prometheus alerting config
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/alertmanagers

# Check Alertmanager status
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
curl http://localhost:9093/api/v1/status

# Check alert rules
curl http://localhost:9090/api/v1/rules

# Check Alertmanager logs
kubectl logs -n prometheus prometheus-alertmanager-0 -f

Issue 6: External Secrets Not Syncing

Symptoms: Secret not created by ExternalSecret

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus

# Check ClusterSecretStore
kubectl get clustersecretstore vault-backend
kubectl describe clustersecretstore vault-backend

# Check External Secrets Operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

# Test Vault connectivity
EXTERNAL_SECRETS_POD=$(kubectl get pod -n external-secrets -l app.kubernetes.io/name=external-secrets -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n external-secrets $EXTERNAL_SECRETS_POD -- wget -O- http://vault.vault:8200/v1/sys/health

Security Best Practices

1. Credential Management

✅ DO:

Store all credentials in Vault
Use External Secrets Operator for K8s secret sync
Rotate credentials regularly (quarterly minimum)
Use strong, randomly generated passwords

❌ DON’T:

Commit credentials to git
Use default passwords
Share credentials across environments
Store credentials in plain text ConfigMaps

2. Network Security

✅ DO:

Implement network policies for all components
Use namespace isolation
Restrict egress to necessary destinations
Label namespaces for policy enforcement

❌ DON’T:

Allow unrestricted pod-to-pod communication
Expose Alertmanager publicly without auth
Allow broad egress to internet

3. Resource Management

✅ DO:

Set resource requests and limits
Use PodDisruptionBudgets
Monitor resource usage
Scale based on metrics

❌ DON’T:

Run without resource limits
Ignore OOMKilled pods
Over-provision resources

4. Data Retention

✅ DO:

Use persistent volumes
Configure appropriate retention
Implement backup strategy
Monitor storage usage

❌ DON’T:

Rely on ephemeral storage
Set unlimited retention
Ignore disk space alerts

5. Alert Configuration

✅ DO:

Use severity labels
Implement inhibit rules
Test alerts regularly
Document runbooks

❌ DON’T:

Alert on everything
Ignore alert fatigue
Skip testing alerts
Forget to document remediation

Performance Optimization

Query Performance

1
2
3
4
5
server:
  extraArgs:
    query.max-concurrency: 20
    query.timeout: 2m
    query.lookback-delta: 5m

Storage Performance

1
2
3
4
server:
  extraArgs:
    storage.tsdb.min-block-duration: 2h
    storage.tsdb.max-block-duration: 2h

Reduce Cardinality

Use relabeling to drop unnecessary labels:

1
2
3
4
5
6
7
8
9
server:
  global:
    scrape_configs:
      - job_name: "kubernetes-pods"
        relabel_configs:
          # Drop high-cardinality labels
          - source_labels: [__name__]
            regex: "go_.*"
            action: drop

Integration with Grafana

Prometheus integrates seamlessly with Grafana for visualization.

Add Prometheus as Datasource

In Grafana’s datasource configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-server.prometheus:80
    isDefault: true
    jsonData:
      timeInterval: 30s

Recommended Dashboards

Import these community dashboards:

1860 - Node Exporter Full (comprehensive node metrics)
7249 - Kubernetes Cluster Monitoring
6417 - Kubernetes Pod Monitoring
315 - Kubernetes Cluster (Prometheus)

Monitoring Your Monitoring

Don’t forget to monitor Prometheus itself!

Key Metrics to Watch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Prometheus memory usage
process_resident_memory_bytes

# Prometheus CPU usage
rate(process_cpu_seconds_total[5m])

# Time series count (cardinality)
prometheus_tsdb_symbol_table_size_bytes

# Query duration
histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket[5m]))

# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])

Set Up Meta-Monitoring

Consider a separate monitoring cluster to monitor your production Prometheus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
┌─────────────────┐
│  Prod Cluster   │
│  - Prometheus   │───┐
│  - Applications │   │
└─────────────────┘   │
                      │ Federation or
                      │ Remote Write
                      ▼
               ┌──────────────┐
               │ Meta Monitor │
               │ - Prometheus │
               │ - Grafana    │
               └──────────────┘

Backup and Disaster Recovery

Manual Backup

1
2
3
4
5
6
7
8
# Create snapshot
PROMETHEUS_POD=$(kubectl get pod -n prometheus -l app.kubernetes.io/name=prometheus-server -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n prometheus $PROMETHEUS_POD -- \
  curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot
kubectl cp prometheus/$PROMETHEUS_POD:/data/snapshots/<timestamp> ./prometheus-backup

Automated Backup with Velero

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Install Velero
velero install --provider aws --bucket prometheus-backups

# Create backup schedule
velero schedule create prometheus-daily \
  --schedule="0 2 * * *" \
  --include-namespaces prometheus \
  --ttl 720h

# Restore from backup
velero restore create --from-backup prometheus-daily-<timestamp>

Configuration Backup

Always backup your configuration:

1
2
3
4
5
6
7
8
# Backup all configuration
kubectl get -n prometheus \
  configmap,secret,externalsecret,middleware,ingress,pvc \
  -o yaml > prometheus-config-backup.yaml

# Store in git (excluding secrets!)
git add helmfile.yaml values.yml external-secret.yml auth-middleware.yml network-policy.yml
git commit -m "Backup Prometheus configuration"

Scaling Considerations

Vertical Scaling

For larger clusters, increase resources:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
server:
  resources:
    limits:
      cpu: 4000m
      memory: 8Gi
    requests:
      cpu: 1000m
      memory: 4Gi

  persistentVolume:
    size: 100Gi

  retention: "90d"
  retentionSize: "95GB"

Horizontal Scaling with Federation

For multi-cluster deployments, use federation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# On central Prometheus
server:
  extraScrapeConfigs: |
    - job_name: 'federate-clusters'
      scrape_interval: 30s
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
          - '{job=~".+"}'
      static_configs:
        - targets:
          - 'prometheus-cluster1.example.com'
          - 'prometheus-cluster2.example.com'

Remote Storage

For long-term retention, use remote storage:

1
2
3
4
5
6
7
8
9
server:
  remoteWrite:
    - url: "http://cortex.monitoring:9009/api/prom/push"
      queueConfig:
        capacity: 10000
        maxShards: 200
        minShards: 1
        maxSamplesPerSend: 500
        batchSendDeadline: 5s

Cost Optimization

Reduce Storage Costs

Tune retention:

1
2
retention: "15d" # Instead of 30d
retentionSize: "20GB" # Instead of 45GB

Drop unnecessary metrics:

1
2
3
4
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "(go_.*|process_.*)"
    action: drop

Use remote storage with compression

Reduce Compute Costs

Use recording rules for expensive queries
Adjust scrape intervals:

1
2
3
global:
  scrape_interval: 60sess_.*)'
    action: drop

Use remote storage with compression

Reduce Compute Costs

Use recording rules for expensive queries
Adjust scrape intervals:

1
2
global:
  scrape_interval: 60s # Instead of 30s for non-critical metrics

Right-size resources based on actual usage

Conclusion

You now have a production-ready Prometheus deployment with:

✅ Secure authentication via Vault and Traefik ✅ High availability with StatefulSets ✅ Persistent storage for data retention ✅ Comprehensive alerting rules ✅ Network policies for zero-trust security ✅ Resource limits to prevent exhaustion ✅ Integration with Grafana ✅ Backup and disaster recovery strategies

Next Steps

Configure Alertmanager receivers (Slack, PagerDuty, email)
Create Grafana dashboards for your applications
Set up ServiceMonitors for application metrics
Implement federation for multi-cluster monitoring
Configure remote storage for long-term retention
Test disaster recovery procedures

Key Takeaways

Security is not optional: Always implement authentication and network policies
Persistence matters: Use persistent volumes to prevent data loss
Monitor your monitoring: Set up alerts for Prometheus itself
Start simple, iterate: Begin with basic alerts and refine based on experience
Document everything: Runbooks save time during incidents
Test regularly: Verify backups and disaster recovery procedures

Resources

About the Author

This guide is based on real-world production deployments at scale. For questions or suggestions, feel free to reach out!

Tags: #kubernetes #prometheus #monitoring #observability #devops #security #vault #traefik #helm #production

Last Updated: January 30, 2026