Observability is a critical component of running production workloads on Kubernetes. When running Amazon Elastic Kubernetes Service (EKS), you need a reliable way to collect, aggregate, and analyze logs from all your containers. In this comprehensive guide, I’ll walk you through setting up Fluent Bit to export your EKS container logs to Amazon CloudWatch Logs.
By the end of this tutorial, you’ll have a production-ready logging pipeline that automatically collects logs from all containers in your EKS cluster and routes them to organized CloudWatch Log Groups based on Kubernetes namespaces.
Table of Contents
- What is Fluent Bit?
- Why Use Fluent Bit with Amazon EKS?
- Architecture Overview
- Prerequisites
- Setting Up IAM Roles for Service Accounts (IRSA)
- Creating the Helm Values Configuration
- Deploying Fluent Bit with Helmfile
- Verifying the Deployment
- Understanding the Log Routing
- Troubleshooting Common Issues
- Best Practices and Recommendations
- Conclusion
What is Fluent Bit?
Fluent Bit is a lightweight, high-performance log processor and forwarder. It’s part of the Fluentd ecosystem but is designed specifically for containerized environments where resource efficiency is critical.
Key Features of Fluent Bit
- Lightweight: Written in C, with a minimal memory footprint (~450KB)
- High Performance: Can handle millions of records per second
- Pluggable Architecture: Supports multiple inputs, filters, and outputs
- Kubernetes Native: Built-in support for Kubernetes metadata enrichment
- Cloud Native: Native integration with AWS, Azure, GCP, and other cloud providers
Fluent Bit vs Fluentd
| Feature | Fluent Bit | Fluentd |
|---|
| Memory Footprint | ~450KB | ~40MB |
| Language | C | Ruby |
| Plugin Ecosystem | Growing | Extensive |
| Use Case | Edge/Container logging | Central aggregation |
| Performance | Higher throughput | Good throughput |
For container logging in Kubernetes, Fluent Bit is the preferred choice due to its efficiency and native Kubernetes support.
Why Use Fluent Bit with Amazon EKS?
Amazon EKS doesn’t provide built-in container log collection. By default, container logs are stored on individual nodes in /var/log/containers/ and are lost when nodes are terminated or replaced. This is problematic for several reasons:
- Ephemeral Nodes: EKS nodes can be replaced at any time (especially with Karpenter or Cluster Autoscaler)
- Distributed Logs: Logs are scattered across multiple nodes
- No Centralized Search: You can’t search across all container logs
- No Retention: Logs are lost when pods restart or nodes terminate
Benefits of CloudWatch Logs Integration
- Centralized Logging: All logs in one place
- Retention Policies: Configure how long to keep logs
- CloudWatch Logs Insights: Powerful query language for log analysis
- Alarms and Metrics: Create alarms based on log patterns
- Integration: Works with AWS services like Lambda, SNS, and EventBridge
Architecture Overview
Here’s how the logging architecture works:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| ┌─────────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │Container │ │ │ │Container │ │ │ │Container │ │ │
│ │ │ Logs │ │ │ │ Logs │ │ │ │ Logs │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │ ┌────▼─────┐ │ │
│ │ │Fluent Bit│ │ │ │Fluent Bit│ │ │ │Fluent Bit│ │ │
│ │ │DaemonSet │ │ │ │DaemonSet │ │ │ │DaemonSet │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
│ │ │ │ │
│ └────────────────┼─────────────────┘ │
│ │ │
│ │ IRSA Authentication │
└──────────────────────────┼──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Amazon CloudWatch Logs │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Log Group: /aws/eks/{cluster-name}/prod │ │
│ │ └── Log Stream: pod-{name}.{namespace}.{container} │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Log Group: /aws/eks/{cluster-name}/stage │ │
│ │ └── Log Stream: pod-{name}.{namespace}.{container} │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Log Group: /aws/eks/{cluster-name}/application │ │
│ │ └── Log Stream: pod-{name}.{namespace}.{container} │ │
│ └─────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
|
How It Works
- Container Logs: Kubernetes writes all container stdout/stderr to
/var/log/containers/ - Fluent Bit DaemonSet: A Fluent Bit pod runs on every node, reading log files
- Metadata Enrichment: Fluent Bit adds Kubernetes metadata (pod name, namespace, labels)
- Log Routing: Logs are routed to different CloudWatch Log Groups based on namespace
- IRSA Authentication: Fluent Bit authenticates to CloudWatch using IAM Roles for Service Accounts
Prerequisites
Before starting, ensure you have the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # AWS CLI v2
aws --version
# aws-cli/2.x.x
# kubectl
kubectl version --client
# Client Version: v1.28.x
# Helm v3
helm version
# version.BuildInfo{Version:"v3.x.x"}
# Helmfile (optional but recommended)
helmfile --version
# helmfile version v0.x.x
# eksctl (for IRSA setup)
eksctl version
# 0.x.x
|
AWS Requirements
- An existing EKS cluster with OIDC provider enabled
- AWS account permissions to create IAM roles and policies
- Access to create CloudWatch Log Groups
Verify EKS OIDC Provider
IRSA (IAM Roles for Service Accounts) requires an OIDC provider. Verify it’s configured:
1
2
3
4
5
6
7
8
| # Get your cluster's OIDC provider
aws eks describe-cluster \
--name YOUR_CLUSTER_NAME \
--query "cluster.identity.oidc.issuer" \
--output text
# Should return something like:
# https://oidc.eks.us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE
|
If no OIDC provider exists, create one:
1
2
3
| eksctl utils associate-iam-oidc-provider \
--cluster YOUR_CLUSTER_NAME \
--approve
|
Setting Up IAM Roles for Service Accounts (IRSA)
IRSA allows Kubernetes pods to assume IAM roles without needing to store AWS credentials. This is the secure, recommended way to grant AWS permissions to pods.
Step 1: Create the IAM Policy
First, create an IAM policy that grants CloudWatch Logs permissions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FluentBitCloudWatchLogs",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": "*"
}
]
}
|
Save this as iam-policy.json and create the policy:
1
2
3
4
| aws iam create-policy \
--policy-name FluentBitCloudWatchLogsPolicy \
--policy-document file://iam-policy.json \
--description "Allows Fluent Bit to write logs to CloudWatch"
|
Note the policy ARN from the output (e.g., arn:aws:iam::123456789012:policy/FluentBitCloudWatchLogsPolicy).
Step 2: Create the IAM Role with OIDC Trust
Using Terraform (recommended for production):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| # terraform-iam.tf
# Data source to get the EKS cluster OIDC provider
data "aws_eks_cluster" "cluster" {
name = var.cluster_name
}
# Extract OIDC provider URL without https://
locals {
oidc_provider = replace(data.aws_eks_cluster.cluster.identity[0].oidc[0].issuer, "https://", "")
}
# Data source for the OIDC provider ARN
data "aws_iam_openid_connect_provider" "cluster" {
url = data.aws_eks_cluster.cluster.identity[0].oidc[0].issuer
}
# IAM Role for Fluent Bit
resource "aws_iam_role" "fluent_bit_cloudwatch" {
name = "${var.cluster_name}-fluent-bit-cloudwatch-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = data.aws_iam_openid_connect_provider.cluster.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${local.oidc_provider}:aud" = "sts.amazonaws.com"
"${local.oidc_provider}:sub" = "system:serviceaccount:amazon-cloudwatch:aws-for-fluent-bit"
}
}
}
]
})
tags = {
Purpose = "Fluent Bit CloudWatch logging"
Cluster = var.cluster_name
}
}
# Attach the CloudWatch Logs policy
resource "aws_iam_role_policy_attachment" "fluent_bit_cloudwatch" {
role = aws_iam_role.fluent_bit_cloudwatch.name
policy_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:policy/FluentBitCloudWatchLogsPolicy"
}
# Output the role ARN for use in Helm values
output "fluent_bit_role_arn" {
description = "IAM role ARN for Fluent Bit service account"
value = aws_iam_role.fluent_bit_cloudwatch.arn
}
|
Alternatively, using eksctl:
1
2
3
4
5
6
7
| eksctl create iamserviceaccount \
--cluster=YOUR_CLUSTER_NAME \
--namespace=amazon-cloudwatch \
--name=aws-for-fluent-bit \
--attach-policy-arn=arn:aws:iam::YOUR_ACCOUNT_ID:policy/FluentBitCloudWatchLogsPolicy \
--approve \
--override-existing-serviceaccounts
|
Understanding the Trust Relationship
The trust policy is critical for security. Let’s break it down:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE:aud": "sts.amazonaws.com",
"oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE:sub": "system:serviceaccount:amazon-cloudwatch:aws-for-fluent-bit"
}
}
}
]
}
|
- Federated Principal: Only the specific EKS cluster’s OIDC provider can assume this role
- aud Condition: Ensures the token is intended for AWS STS
- sub Condition: Restricts to only the specific service account (
aws-for-fluent-bit in the amazon-cloudwatch namespace)
This means even if another pod in the same cluster tries to use this role, it will be denied unless it’s running as the exact service account specified.
Creating the Helm Values Configuration
AWS provides an official Helm chart for Fluent Bit: aws-for-fluent-bit. We’ll configure it with comprehensive settings for production use.
Step 1: Create the Helmfile
Create helmfile.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # helmfile.yaml
repositories:
- name: aws-eks-charts
url: https://aws.github.io/eks-charts
releases:
- name: aws-for-fluent-bit
namespace: amazon-cloudwatch
createNamespace: true
chart: aws-eks-charts/aws-for-fluent-bit
version: "0.2.0"
timeout: 300
values:
- values-{{ requiredEnv "CLUSTER_ENV" }}-{{ requiredEnv "CLUSTER_REGION" }}.yml
|
Step 2: Create the Values File
Create a values file for your environment. Here’s a comprehensive example for a staging cluster in us-west-2:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
| # values-stage-us-west-2.yml
# Global settings
global:
namespaceOverride: "amazon-cloudwatch"
# Container image configuration
image:
repository: public.ecr.aws/aws-observability/aws-for-fluent-bit
tag: "2.32.2"
pullPolicy: IfNotPresent
# Service Account configuration with IRSA
serviceAccount:
create: true
name: aws-for-fluent-bit
annotations:
# Replace with your IAM role ARN
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/staging-us-west-2-fluent-bit-cloudwatch-role"
# Environment variables
env:
- name: AWS_REGION
value: "us-west-2"
- name: CLUSTER_NAME
value: "staging-us-west-2"
# Resource limits and requests
resources:
limits:
memory: 250Mi
requests:
cpu: 100m
memory: 100Mi
# Tolerations - ensure Fluent Bit runs on ALL nodes
tolerations:
- operator: Exists
effect: NoSchedule
# Fluent Bit Configuration
config:
# Service configuration
service: |
[SERVICE]
Daemon Off
Flush 5
Log_Level info
Parsers_File /fluent-bit/etc/parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
# Input configuration - read container logs
inputs: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
# Filters for Kubernetes metadata enrichment
filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels On
Annotations Off
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix k8s_
[FILTER]
Name modify
Match kube.*
Add cluster_name ${CLUSTER_NAME}
# CloudWatch outputs - route by namespace
outputs: |
# Production namespace logs
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region ${AWS_REGION}
log_group_name /aws/eks/${CLUSTER_NAME}/prod
log_stream_prefix pod-
log_stream_template $kubernetes['pod_name'].$kubernetes['namespace_name'].$kubernetes['container_name']
auto_create_group true
log_key log
log_format json/emf
Retry_Limit 2
Match_Regex kube\.var\.log\.containers\..+_prod_.+
# Staging namespace logs
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region ${AWS_REGION}
log_group_name /aws/eks/${CLUSTER_NAME}/stage
log_stream_prefix pod-
log_stream_template $kubernetes['pod_name'].$kubernetes['namespace_name'].$kubernetes['container_name']
auto_create_group true
log_key log
log_format json/emf
Retry_Limit 2
Match_Regex kube\.var\.log\.containers\..+_stage_.+
# All other namespaces (default)
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region ${AWS_REGION}
log_group_name /aws/eks/${CLUSTER_NAME}/application
log_stream_prefix pod-
log_stream_template $kubernetes['pod_name'].$kubernetes['namespace_name'].$kubernetes['container_name']
auto_create_group true
log_key log
log_format json/emf
Retry_Limit 2
# Disable other outputs (we only want CloudWatch)
cloudWatchLogs:
enabled: false # We define custom outputs above
firehose:
enabled: false
kinesis:
enabled: false
elasticsearch:
enabled: false
# DaemonSet specific settings
hostNetwork: false
# Volume mounts for log access
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-state
hostPath:
path: /var/fluent-bit/state
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-state
mountPath: /var/fluent-bit/state
|
Understanding the Configuration
Let’s break down the key sections:
Service Configuration
1
2
3
4
5
6
7
| [SERVICE]
Daemon Off # Run in foreground (required for containers)
Flush 5 # Flush logs every 5 seconds
Log_Level info # Logging verbosity
HTTP_Server On # Enable metrics endpoint
HTTP_Port 2020 # Metrics port for Prometheus scraping
Health_Check On # Enable health checks
|
1
2
3
4
5
6
7
8
9
| [INPUT]
Name tail # Use the tail input plugin
Tag kube.* # Tag all logs with kube prefix
Path /var/log/containers/*.log # Read all container log files
Parser docker # Parse Docker JSON format
DB /var/fluent-bit/state/flb_container.db # State DB for tracking position
Mem_Buf_Limit 5MB # Memory buffer limit per file
Skip_Long_Lines On # Skip lines > 32KB
Refresh_Interval 10 # Check for new files every 10s
|
Kubernetes Filter
1
2
3
4
5
6
7
| [FILTER]
Name kubernetes # Kubernetes metadata filter
Match kube.* # Apply to all kube-tagged logs
Kube_URL https://kubernetes.default.svc:443 # K8s API endpoint
Merge_Log On # Parse and merge JSON logs
Labels On # Include pod labels
Annotations Off # Exclude annotations (can be verbose)
|
CloudWatch Output
1
2
3
4
5
6
7
8
| [OUTPUT]
Name cloudwatch_logs # CloudWatch Logs output plugin
Match kube.* # Match all kube logs
region ${AWS_REGION} # AWS region (from env var)
log_group_name /aws/eks/${CLUSTER_NAME}/application
log_stream_template $kubernetes['pod_name']... # Dynamic stream naming
auto_create_group true # Create log group if missing
Retry_Limit 2 # Retry failed writes twice
|
Deploying Fluent Bit with Helmfile
Create a Deployment Script
Create deploy.sh for easy deployment across environments:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
| #!/bin/bash
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Supported environments and regions
ENVIRONMENTS=("stage" "prod")
REGIONS=("us-west-2" "eu-west-1")
usage() {
echo "Usage: $0 <environment> <region>"
echo " $0 all"
echo " $0 verify <environment> <region>"
echo ""
echo "Examples:"
echo " $0 stage us-west-2 # Deploy to staging US"
echo " $0 prod eu-west-1 # Deploy to prod EU"
echo " $0 all # Deploy to all clusters"
echo " $0 verify stage us-west-2 # Verify deployment"
exit 1
}
deploy_cluster() {
local env=$1
local region=$2
echo -e "${YELLOW}Deploying Fluent Bit to ${env}-${region}...${NC}"
# Set kubectl context (adjust based on your context naming)
kubectl config use-context "arn:aws:eks:${region}:${AWS_ACCOUNT_ID}:cluster/${env}-${region}"
# Export environment variables for helmfile
export CLUSTER_ENV="${env}"
export CLUSTER_REGION="${region}"
# Run helmfile
helmfile apply
echo -e "${GREEN}Successfully deployed to ${env}-${region}${NC}"
}
verify_deployment() {
local env=$1
local region=$2
echo -e "${YELLOW}Verifying deployment in ${env}-${region}...${NC}"
# Switch context
kubectl config use-context "arn:aws:eks:${region}:${AWS_ACCOUNT_ID}:cluster/${env}-${region}"
# Check DaemonSet status
echo "DaemonSet Status:"
kubectl get daemonset -n amazon-cloudwatch aws-for-fluent-bit
# Check pods
echo ""
echo "Pod Status:"
kubectl get pods -n amazon-cloudwatch -l app.kubernetes.io/name=aws-for-fluent-bit
# Check logs from one pod
echo ""
echo "Recent logs from Fluent Bit:"
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=aws-for-fluent-bit --tail=20
}
# Main logic
case "${1}" in
"all")
for env in "${ENVIRONMENTS[@]}"; do
for region in "${REGIONS[@]}"; do
deploy_cluster "$env" "$region"
done
done
;;
"verify")
if [[ -z "$2" || -z "$3" ]]; then
usage
fi
verify_deployment "$2" "$3"
;;
*)
if [[ -z "$1" || -z "$2" ]]; then
usage
fi
deploy_cluster "$1" "$2"
;;
esac
|
Make it executable:
Deploy to Your Cluster
1
2
3
4
5
6
7
8
| # Set your AWS account ID
export AWS_ACCOUNT_ID="123456789012"
# Deploy to a specific cluster
./deploy.sh stage us-west-2
# Or deploy to all clusters
./deploy.sh all
|
Manual Deployment with Helm
If you prefer not to use Helmfile:
1
2
3
4
5
6
7
8
9
10
11
| # Add the Helm repository
helm repo add aws-eks-charts https://aws.github.io/eks-charts
helm repo update
# Create the namespace
kubectl create namespace amazon-cloudwatch
# Install the chart
helm install aws-for-fluent-bit aws-eks-charts/aws-for-fluent-bit \
--namespace amazon-cloudwatch \
--values values-stage-us-west-2.yml
|
Verifying the Deployment
After deployment, verify everything is working correctly.
Check DaemonSet Status
1
2
3
4
5
| kubectl get daemonset -n amazon-cloudwatch
# Expected output:
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
# aws-for-fluent-bit 3 3 3 3 3 <none> 5m
|
The DESIRED count should match your number of nodes, and all should be READY.
Check Pod Status
1
2
3
4
5
6
7
| kubectl get pods -n amazon-cloudwatch -o wide
# Expected output:
# NAME READY STATUS RESTARTS AGE IP NODE
# aws-for-fluent-bit-abc12 1/1 Running 0 5m 10.0.1.50 ip-10-0-1-100.compute.internal
# aws-for-fluent-bit-def34 1/1 Running 0 5m 10.0.2.50 ip-10-0-2-100.compute.internal
# aws-for-fluent-bit-ghi56 1/1 Running 0 5m 10.0.3.50 ip-10-0-3-100.compute.internal
|
Check Fluent Bit Logs
1
2
3
4
| kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=aws-for-fluent-bit --tail=50
# Look for successful CloudWatch connection:
# [2025/02/18 10:00:00] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log group /aws/eks/staging-us-west-2/application
|
Verify Service Account IRSA
1
2
3
4
5
6
7
| # Check service account annotations
kubectl get serviceaccount -n amazon-cloudwatch aws-for-fluent-bit -o yaml
# Should show:
# metadata:
# annotations:
# eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/staging-us-west-2-fluent-bit-cloudwatch-role
|
Check CloudWatch Log Groups
1
2
3
4
5
6
7
8
9
10
11
12
| aws logs describe-log-groups --log-group-name-prefix /aws/eks/staging-us-west-2
# Expected output:
# {
# "logGroups": [
# {
# "logGroupName": "/aws/eks/staging-us-west-2/application",
# "creationTime": 1708251600000,
# "storedBytes": 12345
# }
# ]
# }
|
Test Log Delivery
Create a test pod that generates logs:
1
2
3
4
5
6
7
8
9
10
11
| # Create a test pod
kubectl run log-test --image=busybox --restart=Never -- \
sh -c "while true; do echo 'Test log message at $(date)'; sleep 5; done"
# Wait a minute, then check CloudWatch
aws logs filter-log-events \
--log-group-name /aws/eks/staging-us-west-2/application \
--filter-pattern "Test log message"
# Clean up
kubectl delete pod log-test
|
Understanding the Log Routing
Our configuration routes logs to different CloudWatch Log Groups based on the Kubernetes namespace:
| Namespace | Log Group |
|---|
prod | /aws/eks/{cluster}/prod |
stage | /aws/eks/{cluster}/stage |
| All others | /aws/eks/{cluster}/application |
Log Stream Naming
Each pod gets its own log stream with the naming pattern:
1
| pod-{pod-name}.{namespace}.{container-name}
|
For example:
pod-myapp-7d9f8b6c5d-abc12.prod.myapppod-redis-0.stage.redispod-nginx-ingress-controller-xyz.kube-system.controller
Querying Logs with CloudWatch Logs Insights
Once logs are in CloudWatch, you can use Logs Insights for powerful queries:
1
2
3
4
5
6
| # Find all errors in the prod namespace
fields @timestamp, @message, k8s_pod_name
| filter @logGroup = '/aws/eks/staging-us-west-2/prod'
| filter @message like /error|Error|ERROR/
| sort @timestamp desc
| limit 100
|
1
2
3
4
5
6
| # Count logs by pod in the last hour
fields @timestamp, k8s_pod_name
| filter @logGroup = '/aws/eks/staging-us-west-2/application'
| stats count(*) by k8s_pod_name
| sort count desc
| limit 20
|
1
2
3
4
5
6
| # Find slow API requests (assuming JSON logs with duration field)
fields @timestamp, @message, duration
| filter @logGroup = '/aws/eks/staging-us-west-2/prod'
| filter duration > 1000
| sort duration desc
| limit 50
|
Troubleshooting Common Issues
Issue 1: Pods in CrashLoopBackOff
Symptoms:
1
2
3
| kubectl get pods -n amazon-cloudwatch
# NAME READY STATUS RESTARTS AGE
# aws-for-fluent-bit-abc12 0/1 CrashLoopBackOff 5 10m
|
Solution:
Check logs for configuration errors:
1
2
3
4
5
6
| kubectl logs -n amazon-cloudwatch aws-for-fluent-bit-abc12 --previous
# Common causes:
# - Invalid Fluent Bit configuration syntax
# - Missing environment variables
# - Invalid IAM role ARN
|
Issue 2: No Logs in CloudWatch
Symptoms: Pods are running but no logs appear in CloudWatch.
Debugging Steps:
- Check Fluent Bit metrics:
1
2
| kubectl port-forward -n amazon-cloudwatch svc/aws-for-fluent-bit 2020:2020 &
curl http://localhost:2020/api/v1/metrics/prometheus
|
- Check for AWS errors:
1
| kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=aws-for-fluent-bit | grep -i error
|
- Verify IRSA is working:
1
2
3
4
5
| # Exec into a Fluent Bit pod
kubectl exec -it -n amazon-cloudwatch aws-for-fluent-bit-abc12 -- sh
# Try to get credentials (inside the pod)
aws sts get-caller-identity
|
Issue 3: IRSA Not Working
Symptoms:
1
| AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/... is not authorized
|
Solution:
- Verify the service account annotation:
1
| kubectl get sa -n amazon-cloudwatch aws-for-fluent-bit -o jsonpath='{.metadata.annotations}'
|
Verify the IAM role trust policy matches the OIDC provider and service account
Ensure the namespace and service account name match exactly
Issue 4: High Memory Usage
Symptoms: Fluent Bit pods being OOMKilled.
Solution:
- Increase memory limits:
1
2
3
| resources:
limits:
memory: 500Mi # Increase from 250Mi
|
- Reduce buffer sizes:
1
2
| [INPUT]
Mem_Buf_Limit 2MB # Reduce from 5MB
|
- Enable more aggressive flushing:
1
2
| [SERVICE]
Flush 1 # Flush every second instead of 5
|
Symptoms: Logs arrive in CloudWatch but without pod name, namespace, etc.
Solution:
Ensure the Kubernetes filter is correctly configured:
1
2
3
4
5
6
| [FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
|
Also verify the service account has permissions to query the Kubernetes API.
Best Practices and Recommendations
1. Set Log Retention Policies
CloudWatch logs are retained indefinitely by default. Set retention policies to control costs:
1
2
3
4
| # Set 30-day retention
aws logs put-retention-policy \
--log-group-name /aws/eks/staging-us-west-2/application \
--retention-in-days 30
|
Or use Terraform:
1
2
3
4
5
6
7
8
9
| resource "aws_cloudwatch_log_group" "eks_app_logs" {
name = "/aws/eks/${var.cluster_name}/application"
retention_in_days = 30
tags = {
Environment = var.environment
Cluster = var.cluster_name
}
}
|
2. Use Structured JSON Logging
Configure your applications to output structured JSON logs:
1
2
3
4
5
6
7
| {
"timestamp": "2025-02-18T10:00:00Z",
"level": "INFO",
"message": "User logged in",
"user_id": "12345",
"duration_ms": 150
}
|
This enables powerful queries in CloudWatch Logs Insights:
1
2
3
| fields @timestamp, message, user_id, duration_ms
| filter level = 'ERROR'
| sort @timestamp desc
|
3. Implement Log-Based Alarms
Create CloudWatch Alarms based on log patterns:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Create a metric filter for errors
aws logs put-metric-filter \
--log-group-name /aws/eks/staging-us-west-2/prod \
--filter-name ErrorCount \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ErrorCount,metricNamespace=EKSLogs,metricValue=1
# Create an alarm
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-Prod-High-Errors" \
--metric-name ErrorCount \
--namespace EKSLogs \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-west-2:123456789012:alerts
|
4. Exclude Noisy Logs
Filter out verbose or unnecessary logs to reduce costs:
1
2
3
4
5
| [FILTER]
Name grep
Match kube.*
Exclude log healthcheck
Exclude log kube-probe
|
5. Use Separate IAM Roles Per Cluster
For security and blast radius reduction, use separate IAM roles for each cluster:
staging-us-west-2-fluent-bit-roleprod-us-west-2-fluent-bit-role
This allows you to:
- Revoke access to one cluster without affecting others
- Apply different permissions per environment
- Track CloudWatch API calls per cluster
6. Monitor Fluent Bit Itself
Create a dashboard to monitor Fluent Bit health:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Expose metrics for Prometheus
# The Helm chart already enables this on port 2020
# Create a ServiceMonitor for Prometheus Operator
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
app.kubernetes.io/name: aws-for-fluent-bit
endpoints:
- port: http
interval: 30s
path: /api/v1/metrics/prometheus
EOF
|
Key metrics to monitor:
fluentbit_input_records_total - Records readfluentbit_output_retries_total - CloudWatch retriesfluentbit_output_errors_total - Output errors
7. Plan for Multi-Cluster
If you have multiple clusters, standardize your log group naming:
1
| /aws/eks/{environment}-{region}/{namespace}
|
This enables cross-cluster queries in CloudWatch Logs Insights by selecting multiple log groups.
Conclusion
You now have a production-ready logging pipeline that:
- Collects logs from all containers in your EKS cluster
- Enriches logs with Kubernetes metadata (pod name, namespace, labels)
- Routes logs to organized CloudWatch Log Groups
- Uses secure IRSA authentication (no static credentials)
- Scales automatically with your cluster (DaemonSet)
This setup provides the foundation for observability in your EKS clusters. You can extend it by:
- Adding CloudWatch Alarms for error detection
- Creating dashboards in CloudWatch
- Setting up log-based anomaly detection
- Exporting logs to S3 for long-term archival
Fluent Bit’s efficiency makes it ideal for high-throughput logging scenarios, and its tight integration with AWS services through the official aws-for-fluent-bit chart makes it the recommended choice for EKS logging to CloudWatch.
Next Steps
- Set up CloudWatch Dashboards: Create visualizations for your log data
- Configure Alarms: Alert on error patterns and anomalies
- Implement Log Insights Queries: Save common queries for your team
- Consider Log Archival: Export to S3 for long-term, cost-effective storage
- Explore Container Insights: Enable full observability with metrics and traces
Have questions or run into issues? Feel free to reach out in the comments below!