Production-Ready Prometheus on Kubernetes: Complete Guide with Vault Authentication and High Availability

Comprehensive guide to deploying Prometheus on Kubernetes with HashiCorp Vault authentication, Traefik middleware, network policies, alerting rules, and production-grade security configurations

Introduction

Prometheus has become the de facto standard for monitoring Kubernetes clusters and cloud-native applications. While getting Prometheus running is straightforward, setting it up for production with proper security, authentication, persistence, and alerting requires careful planning.

In this comprehensive guide, I’ll walk you through deploying a production-ready Prometheus stack on Kubernetes with:

  • Secure authentication using HashiCorp Vault and Traefik middleware
  • High availability configuration with StatefulSets
  • Persistent storage for long-term metrics retention
  • Comprehensive alerting rules for infrastructure and application monitoring
  • Network policies for zero-trust security
  • Resource management to prevent resource exhaustion
  • Best practices for production deployments

By the end of this guide, you’ll have a robust monitoring solution that’s secure, scalable, and ready for production workloads.

Why Production-Ready Prometheus Matters

Many teams start with a basic Prometheus deployment, only to encounter issues in production:

  • Data loss from lack of persistent storage
  • Security vulnerabilities from exposed, unauthenticated endpoints
  • Resource exhaustion from missing resource limits
  • Alert fatigue from poorly configured alerting rules
  • Network security gaps without proper network policies

This guide addresses all these concerns with battle-tested configurations.

Architecture Overview

Our Prometheus deployment consists of several components:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
┌─────────────────────────────────────────────────────────────┐
│                     Internet / Users                         │
└───────────────────────────┬─────────────────────────────────┘
                            │ HTTPS (TLS)
                ┌───────────────────────┐
                │  Traefik Ingress      │
                │  + Basic Auth         │
                │  + TLS Termination    │
                └───────────┬───────────┘
            ┌───────────────────────────────┐
            │   Prometheus Server           │
            │   - Metrics Collection        │
            │   - Query Engine              │
            │   - TSDB Storage             │
            │   - Alert Evaluation          │
            └───────┬──────────┬────────────┘
                    │          │
        ┌───────────┘          └────────────┐
        │                                   │
        ▼                                   ▼
┌───────────────┐                  ┌──────────────┐
│ Alertmanager  │                  │   Exporters  │
│ - Alert       │                  │              │
│   Routing     │                  │ • Node       │
│ - Dedup       │                  │ • Kube State │
│ - Silencing   │                  │ • Pushgateway│
└───────────────┘                  └──────────────┘
┌────────────────┐
│ Alert Channels │
│ (Slack, Email) │
└────────────────┘

Components

  1. Prometheus Server: Core time-series database and query engine
  2. Alertmanager: Alert deduplication, grouping, and routing
  3. Node Exporter: Hardware and OS metrics (DaemonSet on every node)
  4. Kube State Metrics: Kubernetes API object metrics
  5. Pushgateway: Metrics collection for short-lived jobs
  6. Traefik Middleware: Authentication layer
  7. External Secrets: Secure credential management via Vault

Prerequisites

Before we begin, ensure you have:

Infrastructure Requirements

  • Kubernetes cluster (v1.20+)
  • kubectl configured and authenticated
  • Helm v3 or Helmfile installed
  • Storage class for persistent volumes (50Gi for Prometheus, 2Gi for Alertmanager)

Required Add-ons

  • Traefik ingress controller
  • cert-manager with Let’s Encrypt issuer configured
  • HashiCorp Vault deployed and configured
  • External Secrets Operator installed

Tools

1
2
3
4
5
6
7
# Install required CLI tools
brew install kubernetes-cli helm helmfile apache2-utils vault

# Verify versions
kubectl version --client
helm version
helmfile version

Step 1: Deploy Prometheus with Helmfile

We’ll use Helmfile for declarative, version-controlled deployments.

1.1 Create Helmfile Configuration

Create helmfile.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
helmDefaults:
  createNamespace: true
  timeout: 300
  wait: false

repositories:
  - name: prometheus-community
    url: https://prometheus-community.github.io/helm-charts

releases:
  - name: prometheus
    namespace: prometheus
    chart: prometheus-community/prometheus
    version: "28.6.0"
    values:
      - ./values.yml

1.2 Create Comprehensive Values Configuration

Create values.yml with production-ready settings:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Prometheus Server Configuration
server:
  baseUrl: "https://prometheus.example.com"

  replicaCount: 1
  statefulSet:
    enabled: true # Use StatefulSet for stable network identity

  # Resource Management - Critical for stability
  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 500m
      memory: 1Gi

  # Persistent Storage - CRITICAL for data retention
  persistentVolume:
    enabled: true
    size: 50Gi
    # storageClass: ""  # Use default storage class

  # Data Retention
  retention: "30d" # Keep metrics for 30 days
  retentionSize: "45GB" # Max storage size

  # Security Context - Run as non-root
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  # High Availability
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

  # Anti-affinity for spreading pods across nodes
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - prometheus-server
            topologyKey: kubernetes.io/hostname

  # Ingress Configuration
  ingress:
    enabled: true
    ingressClassName: traefik
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod-issuer
      traefik.ingress.kubernetes.io/router.middlewares: prometheus-prometheus-auth@kubernetescrd
    hosts:
      - prometheus.example.com
    tls:
      - secretName: prometheus-tls
        hosts:
          - prometheus.example.com

  # Global Prometheus Configuration
  global:
    scrape_interval: 30s
    scrape_timeout: 10s
    evaluation_interval: 30s
    external_labels:
      cluster: "production"
      environment: "production"

  # Enable ServiceMonitor support
  serviceMonitor:
    enabled: true

# Alertmanager Configuration
alertmanager:
  enabled: true
  replicaCount: 2 # HA setup

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

  persistentVolume:
    enabled: true
    size: 2Gi

  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  podDisruptionBudget:
    enabled: true
    minAvailable: 1

# Node Exporter - Metrics from every node
nodeExporter:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 50Mi
    requests:
      cpu: 50m
      memory: 30Mi

  # Host networking for accurate metrics
  hostNetwork: true
  hostPID: true

# Kube State Metrics - Kubernetes object metrics
kubeStateMetrics:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

# Pushgateway - For batch jobs
pushgateway:
  enabled: true

  resources:
    limits:
      cpu: 200m
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi

  persistentVolume:
    enabled: true
    size: 2Gi

1.3 Deploy Prometheus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Navigate to your prometheus directory
cd k8s/releases/prometheus

# Preview what will be deployed
helmfile template

# Deploy Prometheus
helmfile sync

# Watch the deployment
kubectl get pods -n prometheus -w

You should see all components starting up:

1
2
3
4
5
6
7
NAME                                             READY   STATUS    RESTARTS   AGE
prometheus-server-0                              1/1     Running   0          2m
prometheus-alertmanager-0                        1/1     Running   0          2m
prometheus-alertmanager-1                        1/1     Running   0          2m
prometheus-node-exporter-xxxxx                   1/1     Running   0          2m
prometheus-kube-state-metrics-xxxxxxxxx-xxxxx    1/1     Running   0          2m
prometheus-pushgateway-xxxxxxxxx-xxxxx           1/1     Running   0          2m

Step 2: Configure Production-Grade Alert Rules

Alert rules are the backbone of proactive monitoring. Let’s configure comprehensive alerts.

2.1 Node and Infrastructure Alerts

Add to your values.yml under serverFiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
serverFiles:
  alerting_rules.yml:
    groups:
      - name: node_alerts
        rules:
          # Critical: Node completely down
          - alert: NodeDown
            expr: up{job="prometheus-node-exporter"} == 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Node {{ $labels.instance }} is down"
              description: "Node {{ $labels.instance }} has been down for more than 5 minutes"

          # Warning: High CPU usage
          - alert: HighCPUUsage
            expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is above 80% for 10 minutes (current: {{ $value | humanize }}%)"

          # Warning: High memory usage
          - alert: HighMemoryUsage
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is above 85% (current: {{ $value | humanize }}%)"

          # Warning: Low disk space
          - alert: DiskSpaceLow
            expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Disk space is below 15% (current: {{ $value | humanize }}%)"

          # Critical: Very low disk space
          - alert: DiskSpaceCritical
            expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
              description: "Disk space is below 10% (current: {{ $value | humanize }}%)"

2.2 Kubernetes Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
- name: kubernetes_alerts
  rules:
    # Critical: Pod crash looping
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"

    # Warning: Pod not ready
    - alert: PodNotReady
      expr: kube_pod_status_phase{phase!~"Running|Succeeded"} == 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
        description: "Pod has been in {{ $labels.phase }} state for more than 15 minutes"

    # Warning: Deployment replicas mismatch
    - alert: DeploymentReplicasMismatch
      expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has mismatched replicas"
        description: "Desired: {{ $labels.spec_replicas }}, Available: {{ $labels.status_replicas_available }}"

    # Warning: PersistentVolume space low
    - alert: PersistentVolumeSpaceLow
      expr: (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 15
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} space low"
        description: "PVC has less than 15% space available (current: {{ $value | humanize }}%)"

2.3 Prometheus Self-Monitoring Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
- name: prometheus_alerts
  rules:
    # Critical: Config reload failed
    - alert: PrometheusConfigReloadFailed
      expr: prometheus_config_last_reload_successful == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Prometheus config reload failed"
        description: "Prometheus {{ $labels.instance }} config reload has failed"

    # Warning: TSDB compactions failing
    - alert: PrometheusTSDBCompactionsFailed
      expr: rate(prometheus_tsdb_compactions_failed_total[5m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Prometheus TSDB compactions failing"

    # Critical: Not connected to Alertmanager
    - alert: PrometheusNotConnectedToAlertmanager
      expr: prometheus_notifications_alertmanagers_discovered < 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Prometheus not connected to Alertmanager"

2.4 Recording Rules for Performance

Recording rules pre-compute frequently used queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
recording_rules.yml:
  groups:
    - name: node_recording_rules
      interval: 30s
      rules:
        - record: instance:node_cpu_utilization:rate5m
          expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

        - record: instance:node_memory_utilization:ratio
          expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

        - record: instance:node_disk_utilization:ratio
          expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes

Step 3: Secure Prometheus with Vault-Backed Authentication

One of the biggest security mistakes is exposing Prometheus without authentication. Let’s fix that with a production-grade solution.

3.1 The Security Architecture

We’ll implement defense-in-depth with three layers:

  1. HashiCorp Vault: Secure credential storage
  2. External Secrets Operator: Sync credentials to Kubernetes
  3. Traefik Middleware: Enforce authentication
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
┌──────────────┐
│  Vault       │  ← Credentials stored securely
└──────┬───────┘
       │ External Secrets Operator syncs
┌──────────────┐
│  K8s Secret  │  ← Auto-synced every 5 minutes
└──────┬───────┘
       │ Referenced by Traefik Middleware
┌──────────────┐
│  Middleware  │  ← Enforces Basic Auth
└──────┬───────┘
       │ Applied to Ingress
┌──────────────┐
│  Prometheus  │  ← Protected!
└──────────────┘

3.2 Generate Authentication Credentials

1
2
3
4
5
# Generate htpasswd hash
htpasswd -nb admin YourSecurePassword123!
# Output: admin:$apr1$xyz123abc$defgh456...

# Copy the full output, you'll need it next

3.3 Store Credentials in Vault

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Connect to Vault
kubectl exec -it vault-0 -n vault -- sh

# Login to Vault
vault login
# Enter your root token

# Store the credentials
vault kv put kv/infra/prometheus-basic-auth \
    users='admin:$apr1$xyz123abc$defgh456...'

# Verify
vault kv get kv/infra/prometheus-basic-auth

# Exit
exit

3.4 Create ExternalSecret Resource

Create external-secret.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: prometheus-basic-auth
  namespace: prometheus
spec:
  refreshInterval: 5m # Sync from Vault every 5 minutes
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: prometheus-basic-auth
    creationPolicy: Owner
  data:
    - secretKey: users
      remoteRef:
        key: infra/prometheus-basic-auth
        property: users

Apply it:

1
2
3
4
5
6
7
8
kubectl apply -f external-secret.yml

# Verify the secret is synced
kubectl get externalsecret prometheus-basic-auth -n prometheus
# Should show: STATUS: SecretSynced

kubectl get secret prometheus-basic-auth -n prometheus
# Should exist

3.5 Create Traefik Middleware

Create auth-middleware.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: prometheus-auth
  namespace: prometheus
spec:
  basicAuth:
    secret: prometheus-basic-auth
    removeHeader: true # Don't pass auth header to backend

Apply it:

1
2
3
4
kubectl apply -f auth-middleware.yml

# Verify
kubectl get middleware -n prometheus

3.6 Verify Authentication

1
2
3
4
5
6
7
# Test without credentials - should return 401
curl -I https://prometheus.example.com
# HTTP/2 401

# Test with credentials - should return 200
curl -I -u admin:YourSecurePassword123! https://prometheus.example.com
# HTTP/2 200

Success! Your Prometheus is now protected with Vault-backed authentication.

3.7 Rotating Credentials

To rotate credentials:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Generate new password hash
htpasswd -nb admin NewSecurePassword456!

# Update in Vault
kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/prometheus-basic-auth \
    users='admin:$apr1$new-hash...'
exit

# ExternalSecret will auto-sync within 5 minutes
# Or force immediate sync:
kubectl delete secret prometheus-basic-auth -n prometheus
# External Secrets Operator recreates it immediately

Step 4: Configure Alertmanager for Alert Routing

Alertmanager handles deduplication, grouping, and routing of alerts.

4.1 Configure Alert Routing

Add to values.yml under alertmanagerFiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
alertmanagerFiles:
  alertmanager.yml:
    global:
      resolve_timeout: 5m

    route:
      group_by: ["alertname", "cluster", "service"]
      group_wait: 10s # Wait before sending first notification
      group_interval: 10s # Wait before sending batch of new alerts
      repeat_interval: 12h # Re-send after this time
      receiver: "default"
      routes:
        # Critical alerts to immediate notification
        - match:
            severity: critical
          receiver: "critical"
          continue: true
        # Warnings to different channel
        - match:
            severity: warning
          receiver: "warning"

    receivers:
      - name: "default"
        # Default receiver configuration

      - name: "critical"
        # Example: Slack for critical alerts
        # slack_configs:
        #   - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        #     channel: '#critical-alerts'
        #     title: 'Critical Alert'
        #     text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}'

      - name: "warning"
        # Example: Slack for warnings
        # slack_configs:
        #   - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        #     channel: '#warnings'
        #     title: 'Warning Alert'

    # Inhibit rules: Suppress lower severity alerts when higher ones fire
    inhibit_rules:
      - source_match:
          severity: "critical"
        target_match:
          severity: "warning"
        equal: ["alertname", "instance"]

4.2 Configure Slack Integration (Example)

To send alerts to Slack:

  1. Create a Slack webhook URL
  2. Store it in Vault:
1
2
3
4
5
kubectl exec -it vault-0 -n vault -- sh
vault login
vault kv put kv/infra/alertmanager-slack \
    webhook_url='https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
exit
  1. Create ExternalSecret:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: alertmanager-slack
  namespace: prometheus
spec:
  refreshInterval: 5m
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: alertmanager-slack
  data:
    - secretKey: webhook_url
      remoteRef:
        key: infra/alertmanager-slack
        property: webhook_url
  1. Reference in Alertmanager config:
1
2
3
4
5
receivers:
  - name: "critical"
    slack_configs:
      - api_url_file: /etc/alertmanager/secrets/alertmanager-slack/webhook_url
        channel: "#critical-alerts"

Step 5: Implement Network Policies for Zero-Trust Security

Network policies ensure pods can only communicate with authorized services.

5.1 Prometheus Server Network Policy

Create network-policy.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-server-network-policy
  namespace: prometheus
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: prometheus-server
  policyTypes:
    - Ingress
    - Egress

  ingress:
    # Allow from Grafana
    - from:
        - namespaceSelector:
            matchLabels:
              name: grafana
      ports:
        - protocol: TCP
          port: 9090

    # Allow from Traefik ingress
    - from:
        - namespaceSelector:
            matchLabels:
              name: traefik
      ports:
        - protocol: TCP
          port: 9090

    # Allow internal communication
    - from:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 9090

  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53

    # Allow Kubernetes API
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443

    # Allow scraping exporters
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus-node-exporter
      ports:
        - protocol: TCP
          port: 9100

    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: kube-state-metrics
      ports:
        - protocol: TCP
          port: 8080

    # Allow Alertmanager
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus-alertmanager
      ports:
        - protocol: TCP
          port: 9093

    # Allow scraping pods across namespaces
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 9090

5.2 Apply Network Policies

1
2
3
4
kubectl apply -f network-policy.yml

# Verify
kubectl get networkpolicy -n prometheus

Important: Ensure your namespace has the correct labels:

1
2
3
kubectl label namespace grafana name=grafana
kubectl label namespace traefik name=traefik
kubectl label namespace kube-system name=kube-system

Step 6: Verification and Testing

6.1 Verify All Components

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Check pods
kubectl get pods -n prometheus

# Check PVCs
kubectl get pvc -n prometheus

# Check ingress
kubectl get ingress -n prometheus

# Check secrets
kubectl get secrets -n prometheus

# Check middleware
kubectl get middleware -n prometheus

6.2 Access Prometheus UI

Navigate to https://prometheus.example.com and login with your credentials.

6.3 Verify Metrics Collection

Run these queries in the Prometheus UI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Check all targets are up
up

# Node CPU usage
instance:node_cpu_utilization:rate5m

# Pod count by namespace
count by(namespace) (kube_pod_info)

# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

6.4 Test Alerting

Create a test alert:

1
2
3
4
5
6
7
8
9
# Trigger a test alert by creating a crashlooping pod
kubectl run crashpod --image=busybox -- /bin/sh -c "exit 1"

# Check alerts in Prometheus UI
# Navigate to: Alerts tab

# Check Alertmanager
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
# Open: http://localhost:9093

Troubleshooting Guide

Issue 1: Authentication Not Working

Symptoms: Getting 401 even with correct credentials

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Check middleware exists
kubectl get middleware prometheus-auth -n prometheus

# Check secret content
kubectl get secret prometheus-basic-auth -n prometheus -o jsonpath='{.data.users}' | base64 -d

# Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus

# Verify Vault has credentials
kubectl exec -it vault-0 -n vault -- vault kv get kv/infra/prometheus-basic-auth

# Check ingress annotations
kubectl get ingress -n prometheus -o yaml | grep middleware

Issue 2: High Memory Usage

Symptoms: Prometheus pod getting OOMKilled

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Check current usage
kubectl top pods -n prometheus

# Check cardinality (high cardinality = high memory)
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/status/tsdb | jq .

# Solutions:
# 1. Increase memory limits in values.yml
# 2. Reduce retention period
# 3. Drop high-cardinality metrics with relabel configs

Issue 3: Targets Not Scraped

Symptoms: Targets showing as down in Prometheus UI

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check service discovery
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health!="up")'

# Check RBAC permissions
kubectl get clusterrole prometheus-server -o yaml

# Check network policies
kubectl get networkpolicy -n prometheus -o yaml

# Test connectivity
kubectl exec -it prometheus-server-0 -n prometheus -- wget -O- http://prometheus-node-exporter:9100/metrics

Issue 4: PVC Not Binding

Symptoms: Pod stuck in Pending, PVC not bound

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check PVC status
kubectl get pvc -n prometheus
kubectl describe pvc <pvc-name> -n prometheus

# Check storage class
kubectl get storageclass

# Verify storage provisioner is running
kubectl get pods -n kube-system | grep provisioner

# Manual fix: Delete PVC and let it recreate
kubectl delete pvc -n prometheus <pvc-name>

Issue 5: Alertmanager Not Receiving Alerts

Symptoms: Alerts firing in Prometheus but not in Alertmanager

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check Prometheus alerting config
kubectl port-forward -n prometheus svc/prometheus-server 9090:80
curl http://localhost:9090/api/v1/alertmanagers

# Check Alertmanager status
kubectl port-forward -n prometheus svc/prometheus-alertmanager 9093:9093
curl http://localhost:9093/api/v1/status

# Check alert rules
curl http://localhost:9090/api/v1/rules

# Check Alertmanager logs
kubectl logs -n prometheus prometheus-alertmanager-0 -f

Issue 6: External Secrets Not Syncing

Symptoms: Secret not created by ExternalSecret

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check ExternalSecret status
kubectl describe externalsecret prometheus-basic-auth -n prometheus

# Check ClusterSecretStore
kubectl get clustersecretstore vault-backend
kubectl describe clustersecretstore vault-backend

# Check External Secrets Operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

# Test Vault connectivity
EXTERNAL_SECRETS_POD=$(kubectl get pod -n external-secrets -l app.kubernetes.io/name=external-secrets -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n external-secrets $EXTERNAL_SECRETS_POD -- wget -O- http://vault.vault:8200/v1/sys/health

Security Best Practices

1. Credential Management

DO:

  • Store all credentials in Vault
  • Use External Secrets Operator for K8s secret sync
  • Rotate credentials regularly (quarterly minimum)
  • Use strong, randomly generated passwords

DON’T:

  • Commit credentials to git
  • Use default passwords
  • Share credentials across environments
  • Store credentials in plain text ConfigMaps

2. Network Security

DO:

  • Implement network policies for all components
  • Use namespace isolation
  • Restrict egress to necessary destinations
  • Label namespaces for policy enforcement

DON’T:

  • Allow unrestricted pod-to-pod communication
  • Expose Alertmanager publicly without auth
  • Allow broad egress to internet

3. Resource Management

DO:

  • Set resource requests and limits
  • Use PodDisruptionBudgets
  • Monitor resource usage
  • Scale based on metrics

DON’T:

  • Run without resource limits
  • Ignore OOMKilled pods
  • Over-provision resources

4. Data Retention

DO:

  • Use persistent volumes
  • Configure appropriate retention
  • Implement backup strategy
  • Monitor storage usage

DON’T:

  • Rely on ephemeral storage
  • Set unlimited retention
  • Ignore disk space alerts

5. Alert Configuration

DO:

  • Use severity labels
  • Implement inhibit rules
  • Test alerts regularly
  • Document runbooks

DON’T:

  • Alert on everything
  • Ignore alert fatigue
  • Skip testing alerts
  • Forget to document remediation

Performance Optimization

Query Performance

1
2
3
4
5
server:
  extraArgs:
    query.max-concurrency: 20
    query.timeout: 2m
    query.lookback-delta: 5m

Storage Performance

1
2
3
4
server:
  extraArgs:
    storage.tsdb.min-block-duration: 2h
    storage.tsdb.max-block-duration: 2h

Reduce Cardinality

Use relabeling to drop unnecessary labels:

1
2
3
4
5
6
7
8
9
server:
  global:
    scrape_configs:
      - job_name: "kubernetes-pods"
        relabel_configs:
          # Drop high-cardinality labels
          - source_labels: [__name__]
            regex: "go_.*"
            action: drop

Integration with Grafana

Prometheus integrates seamlessly with Grafana for visualization.

Add Prometheus as Datasource

In Grafana’s datasource configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-server.prometheus:80
    isDefault: true
    jsonData:
      timeInterval: 30s

Import these community dashboards:

  • 1860 - Node Exporter Full (comprehensive node metrics)
  • 7249 - Kubernetes Cluster Monitoring
  • 6417 - Kubernetes Pod Monitoring
  • 315 - Kubernetes Cluster (Prometheus)

Monitoring Your Monitoring

Don’t forget to monitor Prometheus itself!

Key Metrics to Watch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Prometheus memory usage
process_resident_memory_bytes

# Prometheus CPU usage
rate(process_cpu_seconds_total[5m])

# Time series count (cardinality)
prometheus_tsdb_symbol_table_size_bytes

# Query duration
histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket[5m]))

# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])

Set Up Meta-Monitoring

Consider a separate monitoring cluster to monitor your production Prometheus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
┌─────────────────┐
│  Prod Cluster   │
│  - Prometheus   │───┐
│  - Applications │   │
└─────────────────┘   │
                      │ Federation or
                      │ Remote Write
               ┌──────────────┐
               │ Meta Monitor │
               │ - Prometheus │
               │ - Grafana    │
               └──────────────┘

Backup and Disaster Recovery

Manual Backup

1
2
3
4
5
6
7
8
# Create snapshot
PROMETHEUS_POD=$(kubectl get pod -n prometheus -l app.kubernetes.io/name=prometheus-server -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n prometheus $PROMETHEUS_POD -- \
  curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot
kubectl cp prometheus/$PROMETHEUS_POD:/data/snapshots/<timestamp> ./prometheus-backup

Automated Backup with Velero

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Install Velero
velero install --provider aws --bucket prometheus-backups

# Create backup schedule
velero schedule create prometheus-daily \
  --schedule="0 2 * * *" \
  --include-namespaces prometheus \
  --ttl 720h

# Restore from backup
velero restore create --from-backup prometheus-daily-<timestamp>

Configuration Backup

Always backup your configuration:

1
2
3
4
5
6
7
8
# Backup all configuration
kubectl get -n prometheus \
  configmap,secret,externalsecret,middleware,ingress,pvc \
  -o yaml > prometheus-config-backup.yaml

# Store in git (excluding secrets!)
git add helmfile.yaml values.yml external-secret.yml auth-middleware.yml network-policy.yml
git commit -m "Backup Prometheus configuration"

Scaling Considerations

Vertical Scaling

For larger clusters, increase resources:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
server:
  resources:
    limits:
      cpu: 4000m
      memory: 8Gi
    requests:
      cpu: 1000m
      memory: 4Gi

  persistentVolume:
    size: 100Gi

  retention: "90d"
  retentionSize: "95GB"

Horizontal Scaling with Federation

For multi-cluster deployments, use federation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# On central Prometheus
server:
  extraScrapeConfigs: |
    - job_name: 'federate-clusters'
      scrape_interval: 30s
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
          - '{job=~".+"}'
      static_configs:
        - targets:
          - 'prometheus-cluster1.example.com'
          - 'prometheus-cluster2.example.com'

Remote Storage

For long-term retention, use remote storage:

1
2
3
4
5
6
7
8
9
server:
  remoteWrite:
    - url: "http://cortex.monitoring:9009/api/prom/push"
      queueConfig:
        capacity: 10000
        maxShards: 200
        minShards: 1
        maxSamplesPerSend: 500
        batchSendDeadline: 5s

Cost Optimization

Reduce Storage Costs

  1. Tune retention:
1
2
retention: "15d" # Instead of 30d
retentionSize: "20GB" # Instead of 45GB
  1. Drop unnecessary metrics:
1
2
3
4
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "(go_.*|process_.*)"
    action: drop
  1. Use remote storage with compression

Reduce Compute Costs

  1. Use recording rules for expensive queries
  2. Adjust scrape intervals:
1
2
3
global:
  scrape_interval: 60sess_.*)'
    action: drop
  1. Use remote storage with compression

Reduce Compute Costs

  1. Use recording rules for expensive queries
  2. Adjust scrape intervals:
1
2
global:
  scrape_interval: 60s # Instead of 30s for non-critical metrics
  1. Right-size resources based on actual usage

Conclusion

You now have a production-ready Prometheus deployment with:

✅ Secure authentication via Vault and Traefik ✅ High availability with StatefulSets ✅ Persistent storage for data retention ✅ Comprehensive alerting rules ✅ Network policies for zero-trust security ✅ Resource limits to prevent exhaustion ✅ Integration with Grafana ✅ Backup and disaster recovery strategies

Next Steps

  1. Configure Alertmanager receivers (Slack, PagerDuty, email)
  2. Create Grafana dashboards for your applications
  3. Set up ServiceMonitors for application metrics
  4. Implement federation for multi-cluster monitoring
  5. Configure remote storage for long-term retention
  6. Test disaster recovery procedures

Key Takeaways

  • Security is not optional: Always implement authentication and network policies
  • Persistence matters: Use persistent volumes to prevent data loss
  • Monitor your monitoring: Set up alerts for Prometheus itself
  • Start simple, iterate: Begin with basic alerts and refine based on experience
  • Document everything: Runbooks save time during incidents
  • Test regularly: Verify backups and disaster recovery procedures

Resources

About the Author

This guide is based on real-world production deployments at scale. For questions or suggestions, feel free to reach out!


Tags: #kubernetes #prometheus #monitoring #observability #devops #security #vault #traefik #helm #production

Last Updated: January 30, 2026

comments powered by Disqus
Citizix Ltd
Built with Hugo
Theme Stack designed by Jimmy