TASK K: Kubernetes Monitoring (Prometheus + Grafana + k9s) — Step-by-Step Guide

Overview

In this task, you add Kubernetes-native monitoring to your k3s cluster using Prometheus (metrics collection), Grafana (visualization), and k9s (terminal UI). This complements the Datadog infrastructure monitoring from Task D.

Why both Datadog AND Prometheus?

Datadog (Task D): Monitors the operating system — CPU, memory, disk, network. Installed via Ansible. Runs as a system service. Data goes to Datadog's cloud (SaaS).
Prometheus (Task K): Monitors Kubernetes objects — pods, deployments, containers, HPA. Runs INSIDE the cluster as pods. Data stays in-cluster.

Think of it this way: Datadog is the building security camera (watches the physical servers). Prometheus is the restaurant manager (watches the kitchen staff — your pods).

What you'll do:

Understand the monitoring architecture
Learn Helm — charts, releases, repositories, values, upgrade, rollback
Install kube-prometheus-stack via Helm (Helm as the real-world use case)
Access Grafana and explore pre-built dashboards
Monitor HealthPulse pods and observe scaling events
Create a custom HealthPulse dashboard with PromQL
Set up alerts in Grafana
Use k9s alongside Grafana for cluster management
Compare Datadog vs Prometheus
Document everything in MkDocs

Prerequisites

Before starting Task K, ensure you have completed:

k3s cluster running (1 master + 2 workers), application deployed
Datadog agents installed on all nodes (OS-level monitoring)
kubectl configured with KUBECONFIG=~/.kube/healthpulse-config
Application deployed in healthpulse-dev, healthpulse-uat, healthpulse-prod namespaces

Verify your cluster:

export KUBECONFIG=~/.kube/healthpulse-config
kubectl get nodes
# All 3 nodes should show Ready

Step 1: Understand the Monitoring Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         k3s Cluster                              │
│                                                                  │
│  ┌─── monitoring namespace ────────────────────────────────────┐ │
│  │                                                              │ │
│  │  ┌──────────────┐  scrapes   ┌────────────────────────────┐ │ │
│  │  │  Prometheus   │◄─every───│  kube-state-metrics         │ │ │
│  │  │  Server       │  30s     │  (pod/deploy/svc counts)    │ │ │
│  │  │              │          └────────────────────────────┘ │ │
│  │  │  Stores all   │  scrapes   ┌────────────────────────────┐ │ │
│  │  │  time-series   │◄─every───│  Node Exporter (DaemonSet) │ │ │
│  │  │  metrics       │  30s     │  (CPU/mem/disk per node)   │ │ │
│  │  └──────┬───────┘          └────────────────────────────┘ │ │
│  │         │ PromQL queries                                    │ │
│  │         ▼                     ┌────────────────────────────┐ │ │
│  │  ┌──────────────┐            │  Alertmanager              │ │ │
│  │  │   Grafana     │            │  (email/Slack alerts)      │ │ │
│  │  │  Dashboards   │            └────────────────────────────┘ │ │
│  │  └──────┬───────┘                                           │ │
│  └─────────┼───────────────────────────────────────────────────┘ │
│            │ port-forward :3000                                   │
│            ▼                                                      │
│  YOUR BROWSER → localhost:3000                                   │
│                                                                  │
│  ┌─── healthpulse-prod ──┐                                       │
│  │  Pod 1  │  Pod 2      │ ◄── Prometheus scrapes automatically  │
│  └─────────┴─────────────┘                                       │
└──────────────────────────────────────────────────────────────────┘

The scraping model: Prometheus uses a pull model — it calls each target's /metrics HTTP endpoint every 30 seconds, collects the data, and stores it as time-series. Grafana then queries Prometheus using PromQL.

Key insight: Datadog agents PUSH metrics to the cloud. Prometheus PULLS metrics from targets. Different approach, same goal.

Step 2: Learn Helm

2.1 — The Problem Helm Solves

Before Helm, installing something like a monitoring stack on Kubernetes meant managing dozens of YAML files by hand. Consider what the Prometheus + Grafana stack actually requires:

Without Helm — you manage ALL of this manually:
├── prometheus-deployment.yml
├── prometheus-service.yml
├── prometheus-configmap.yml          (scrape config, 200+ lines)
├── prometheus-rbac.yml               (ClusterRole, ClusterRoleBinding, ServiceAccount)
├── prometheus-persistentvolume.yml
├── grafana-deployment.yml
├── grafana-service.yml
├── grafana-configmap.yml             (datasources, dashboards)
├── grafana-secret.yml                (admin password)
├── alertmanager-deployment.yml
├── alertmanager-configmap.yml
├── alertmanager-service.yml
├── node-exporter-daemonset.yml       (runs on every node)
├── node-exporter-service.yml
├── kube-state-metrics-deployment.yml
├── kube-state-metrics-rbac.yml
└── ... (20+ more files)

And every time you upgrade, you diff all of them manually. Every environment (dev, UAT, prod) needs its own copy with slightly different values.

Helm solves this — all 20+ manifests become one install command, with configuration in one place.

2.2 — What Helm Is

Helm is the package manager for Kubernetes — like apt for Ubuntu or brew for Mac, but for Kubernetes applications.

apt install nginx          →   helm install nginx ingress-nginx/ingress-nginx
brew install postgresql    →   helm install postgres bitnami/postgresql
npm install react          →   helm install monitoring prometheus-community/kube-prometheus-stack

Three concepts to understand:

Chart — A Helm package. Contains templated Kubernetes YAML files + default configuration. Think of it like a .deb package (apt) or a formula (brew).

Release — A running instance of a chart in your cluster. You can install the same chart multiple times with different names and configs. Each installation is a separate release.

Repository — A collection of charts, hosted on a URL. Like npm registry or apt sources.

Repository                    Chart                         Release
─────────────────────         ──────────────────────        ─────────────────────
prometheus-community     →    kube-prometheus-stack    →    "monitoring" (your name)
(chart server)                (the package)                 (running in cluster)

2.3 — Helm Architecture

┌─────────────────────────────────────────────────────┐
│  Chart Repository (remote server)                   │
│  e.g. prometheus-community.github.io/helm-charts    │
│  ├── kube-prometheus-stack-65.1.0.tgz               │
│  ├── kube-prometheus-stack-64.0.0.tgz               │
│  └── ...                                            │
└────────────────────┬────────────────────────────────┘
                     │  helm repo add / helm install
                     ▼
┌─────────────────────────────────────────────────────┐
│  Your Machine (Helm CLI)                            │
│  ├── ~/.cache/helm/repository/   (cached charts)    │
│  └── reads KUBECONFIG → talks to cluster API        │
└────────────────────┬────────────────────────────────┘
                     │  generates + applies YAML
                     ▼
┌─────────────────────────────────────────────────────┐
│  k3s Cluster                                        │
│  └── monitoring namespace                           │
│      ├── Deployment: grafana                        │
│      ├── Deployment: kube-state-metrics             │
│      ├── StatefulSet: prometheus                    │
│      ├── DaemonSet: node-exporter                   │
│      ├── Service: monitoring-grafana                │
│      └── ... (20+ resources, all managed by Helm)   │
└─────────────────────────────────────────────────────┘

Helm also stores release history as secrets inside the cluster — which is how helm rollback works.

2.4 — Install Helm

On the k3s master (or your laptop):

Mac:

brew install helm

Linux (including EC2 master):

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Windows:

scoop install helm
# or
choco install kubernetes-helm

Verify:

helm version
# → version.BuildInfo{Version:"v3.x.x", ...}

Helm uses your KUBECONFIG — whichever cluster kubectl talks to, Helm talks to as well. No separate config needed.

2.5 — Core Helm Commands

Before installing anything, get familiar with the Helm CLI:

# ── Repositories ─────────────────────────────────────────────────
helm repo add <name> <url>       # Add a chart repository
helm repo update                 # Refresh the local index from all repos
helm repo list                   # Show all added repositories
helm search repo <keyword>       # Search charts across all repos

# ── Inspecting Charts (before installing) ────────────────────────
helm show chart <repo/chart>     # Show chart metadata (name, version, description)
helm show values <repo/chart>    # Show ALL configurable values (can be 1000+ lines)
helm template <name> <repo/chart> --values <file>  # Preview the YAML that would be applied

# ── Installing & Managing ─────────────────────────────────────────
helm install <release> <repo/chart> [flags]        # Install a chart
helm upgrade <release> <repo/chart> [flags]        # Upgrade an existing release
helm rollback <release> [revision]                 # Rollback to a previous version
helm uninstall <release> -n <namespace>            # Remove a release

# ── Inspecting Releases ──────────────────────────────────────────
helm list -n <namespace>         # List all releases in a namespace
helm list -A                     # List releases across all namespaces
helm status <release> -n <namespace>   # Show release status and notes
helm get values <release> -n <namespace>   # Show values used for a release
helm get manifest <release> -n <namespace> # Show all YAML applied by a release
helm history <release> -n <namespace>      # Show upgrade/rollback history

You'll use all of these in the steps below.

Step 3: Install kube-prometheus-stack via Helm

Now apply what you just learned. You'll install the entire Prometheus + Grafana monitoring stack using a single Helm chart.

3.1 — Add the Chart Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Verify it's there:

helm repo list

NAME                    URL
prometheus-community    https://prometheus-community.github.io/helm-charts

Search the repo to confirm the chart exists:

helm search repo prometheus-community/kube-prometheus-stack

NAME                                            CHART VERSION   APP VERSION   DESCRIPTION
prometheus-community/kube-prometheus-stack      65.1.0          v0.79.2       kube-prometheus-stack collects Kubernetes...

This tells you the chart version (65.1.0) and what app version it installs (Prometheus v0.79.2). Pin the chart version in production to prevent unexpected upgrades.

3.2 — Inspect the Chart Before Installing

Always look before you install. Two commands to know:

# Show chart metadata — what it contains, dependencies, maintainers
helm show chart prometheus-community/kube-prometheus-stack

# Show all configurable values — scroll through to understand what can be changed
helm show values prometheus-community/kube-prometheus-stack | head -100

The values output is typically 1000+ lines. This is your configuration reference — every setting you might want to override is listed here with its default.

3.3 — Understand the Values File

Instead of passing every option as --set flags on the command line, Helm lets you put all your configuration overrides in a values.yml file.

Open kubernetes/monitoring/values.yml:

grafana:
  adminUser: admin
  adminPassword: healthpulse123    # override the default random password
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 300m
      memory: 256Mi
  service:
    type: ClusterIP                # use port-forward, not LoadBalancer

prometheus:
  prometheusSpec:
    retention: 7d                  # keep metrics for 7 days
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1Gi
    storageSpec: {}                # emptyDir — data lost on pod restart (fine for capstone)

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi

How values work:

Chart defaults (values.yaml in chart)
         +
Your overrides (kubernetes/monitoring/values.yml)
         =
Final configuration applied to cluster

You only need to specify what you want to change. Everything else uses the chart's defaults. This is the key advantage over managing raw YAML — you only touch what matters to you.

--set vs -f values.yml:

Method	Use When
`--set key=value`	One or two quick overrides, testing
`-f values.yml`	Multiple settings, version-controlled config

For this capstone, we use a values.yml file so your configuration is committed to Git alongside the rest of the code.

3.4 — Preview What Will Be Installed (helm template)

Before actually installing, use helm template to render the YAML that Helm would apply. This is useful for debugging and learning:

helm template monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml | head -80

This outputs all the Kubernetes YAML that will be created — Deployments, Services, ConfigMaps, RBAC — without touching your cluster. Pipe it to a file to read through it:

helm template monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml > /tmp/monitoring-preview.yml

wc -l /tmp/monitoring-preview.yml   # see how many lines Helm generates for you

Teaching moment: Count the lines. The chart generates thousands of lines of YAML that you would otherwise maintain by hand. This is what Helm does for you.

3.5 — Install the Stack

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Breaking down the command:

Part	What It Does
`helm install`	Install a chart as a new release
`monitoring`	Release name — your chosen name for this installation. Use this in `helm upgrade`, `helm status`, `helm uninstall`
`prometheus-community/kube-prometheus-stack`	`<repo-name>/<chart-name>` — which chart to install
`--namespace monitoring`	Install all resources into this namespace
`--create-namespace`	Create the namespace if it doesn't exist (saves a separate `kubectl create namespace`)
`-f kubernetes/monitoring/values.yml`	Apply your configuration overrides

Helm prints a summary and release notes when done — read them. They often contain the next steps (like how to access the UI).

3.6 — Inspect the Release

# See the release is deployed
helm list -n monitoring

NAME        NAMESPACE   REVISION  STATUS    CHART                           APP VERSION
monitoring  monitoring  1         deployed  kube-prometheus-stack-65.1.0    v0.79.2

# Show status and the release notes
helm status monitoring -n monitoring

# Show what values are actually in use (your overrides merged with chart defaults)
helm get values monitoring -n monitoring

# Show the full YAML that was applied to the cluster
helm get manifest monitoring -n monitoring | head -50

# Show release history (useful later when you upgrade)
helm history monitoring -n monitoring

REVISION  STATUS    CHART                           DESCRIPTION
1         deployed  kube-prometheus-stack-65.1.0    Install complete

3.7 — What Gets Installed

kubectl get pods -n monitoring

Wait 2–3 minutes. All pods should show Running:

NAME                                                         READY   STATUS    AGE
alertmanager-monitoring-kube-prometheus-alertmanager-0       2/2     Running   2m
monitoring-grafana-7f8c9d6b4-xxxxx                           3/3     Running   2m
monitoring-kube-prometheus-operator-6b4c9f8d7-xxxxx          1/1     Running   2m
monitoring-kube-state-metrics-5c6d8f9b7-xxxxx                1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
prometheus-monitoring-kube-prometheus-prometheus-0           2/2     Running   2m

Component	Kind	What It Does
Prometheus	StatefulSet	Scrapes and stores all metrics
Grafana	Deployment	Dashboard UI for visualizing metrics
Node Exporter	DaemonSet	Exposes OS metrics from each node (one pod per node — that's why you see 3)
kube-state-metrics	Deployment	Exposes Kubernetes object metrics (pod count, deploy status)
Alertmanager	StatefulSet	Routes alerts to email, Slack, PagerDuty
Prometheus Operator	Deployment	Manages Prometheus config via CRDs — the "brain" of the stack

Notice the 3 node-exporter pods. Node Exporter is a DaemonSet — Kubernetes automatically runs one copy on every node. Helm configured this for you.

kubectl get svc -n monitoring

Checkpoint: All pods Running and services created. One command installed 8 pods, 10+ services, ConfigMaps, RBAC, and CRDs. That is what Helm does.

3.8 — Upgrading a Release -OPTIONAL

Later, if you want to change the Grafana password or increase Prometheus retention, edit kubernetes/monitoring/values.yml and run:

helm upgrade monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml

Helm applies only the diff — it doesn't reinstall everything. The revision number increments:

helm history monitoring -n monitoring

REVISION  STATUS      CHART                           DESCRIPTION
1         superseded  kube-prometheus-stack-65.1.0    Install complete
2         deployed    kube-prometheus-stack-65.1.0    Upgrade complete

3.9 — Rolling Back

If an upgrade breaks something, rollback to the previous revision:

# Rollback to revision 1
helm rollback monitoring 1 -n monitoring

# Verify
helm history monitoring -n monitoring

REVISION  STATUS      DESCRIPTION
1         superseded  Install complete
2         superseded  Upgrade complete
3         deployed    Rollback to 1

This is possible because Helm stores release history as Kubernetes secrets. Each revision is saved — Helm can reconstruct any previous state. Compare this to manually applying kubectl apply — there is no rollback history at all.

3.10 — If You Need to Reinstall

helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring

# Wait for namespace to terminate
kubectl get namespace monitoring   # keep running until it disappears

# Then reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Step 4: Access Grafana

4.1.1 — Option A: Traefik Ingress (Recommended — DNS required)

Grafana gets its own subdomain routed through Traefik, exactly like your application environments. No port-forward needed, no open terminal.

The wildcard DNS record (*.team-healthpulse.com) already covers grafana.team-healthpulse.com — no Terraform changes needed.

Apply the Grafana ingress:

kubectl apply -f kubernetes/ingress-grafana.yml

# Verify Traefik picked it up
kubectl get ingress -n monitoring

NAME              CLASS     HOSTS                           ADDRESS     PORTS   AGE
grafana-ingress   traefik   grafana.team-healthpulse.com   10.43.0.1   80      10s

Open http://grafana.team-healthpulse.com in your browser.

Why this works without DNS changes: The Terraform DNS config includes a wildcard A record (*.team-healthpulse.com → k3s master EIP). Any subdomain not explicitly defined — including grafana — automatically resolves to the k3s master, where Traefik picks it up and routes it based on the Ingress hostname rule.

4.1.2 — Port-Forward

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 -n monitoring #local
kubectl port-forward --address 0.0.0.0 svc/monitoring-grafana 3000:80 -n monitoring#ec2 master

Keep this terminal running. Open a new terminal for other commands.

4.2 — Login

Open browser: http://localhost:3000
Username: admin / Password: healthpulse123

4.3 — Grafana UI Overview

Area	Where	What It Shows
Dashboards	Left sidebar → Dashboards	Browse all pre-built and custom dashboards
Explore	Left sidebar → Explore	Free-form PromQL query editor
Alerting	Left sidebar → Alerting	Alert rules, notification channels
Data Sources	Settings → Data Sources	Prometheus (pre-configured)

Step 5: Explore Pre-Built Dashboards

The kube-prometheus-stack ships with 20+ production-grade dashboards.

Click Dashboards → Browse
Look for dashboards starting with Kubernetes / and Node Exporter /

Key Dashboard 1: Kubernetes / Compute Resources / Namespace (Pods)

Select a namespace from the dropdown to see:

Panel	What It Shows
CPU Usage	CPU consumed by each pod — are pods CPU-starved?
CPU Quota	CPU requested vs actual — over-provisioning or under-provisioning?
Memory Usage	Memory per pod — approaching limits? (OOMKill risk)
Network I/O	Bytes sent/received per pod

Key Dashboard 2: Node Exporter / Nodes

Shows each EC2 instance's resources — compare with Datadog:

Panel	Datadog Equivalent
CPU Busy	`system.cpu.user`
Memory Usage	`system.mem.used`
Disk Space	`system.disk.in_use`
Network Traffic	`system.net.bytes_rcvd`

Key insight: The numbers from Node Exporter and Datadog should be very close. Cross-referencing validates both monitoring systems.

Key Dashboard 3: Kubernetes / Networking / Namespace (Pods)

Shows receive/transmit bandwidth and packet rates per pod.

Step 6: Monitor HealthPulse Pods

6.1 — View Production Metrics

Open Kubernetes / Compute Resources / Namespace (Pods) dashboard
Select namespace: healthpulse-prod
Observe CPU and memory for every production pod

6.2 — Deploy and Watch Metrics Change

Terminal 1: Keep Grafana open on the namespace dashboard

Terminal 2: Deploy a new version:

kubectl set image deployment/healthpulse-portal \
  healthpulse-portal=<ARTIFACTORY_URL>/healthpulse-portal:2.0.0 \
  -n healthpulse-prod

Watch Grafana — old pod lines end, new pod lines appear, with a brief overlap (rolling update).

6.3 — Trigger HPA and Watch Scaling

# Generate load
kubectl run load-test --image=busybox -n healthpulse-prod --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://healthpulse-service/health; done"

# In another terminal — watch HPA scale
kubectl get hpa -n healthpulse-prod -w

# Clean up when done
kubectl delete pod load-test -n healthpulse-prod

Watch Grafana: CPU climbs, new pod lines appear as HPA scales up.

Step 7: Create a Custom HealthPulse Dashboard

Click Dashboards → New → New Dashboard
Rename to: HealthPulse — Kubernetes Overview

Panel 1: Pod Count by Namespace

PromQL: count(kube_pod_info) by (namespace)
Type: Bar gauge or Stat
Counts pods in each namespace using kube-state-metrics data.

Panel 2: CPU Usage by Pod (Production)

PromQL: rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m])
Type: Time series
rate() converts a cumulative counter into per-second usage over 5 minutes. The container!="POD" filter excludes Kubernetes pause containers.

Panel 3: Memory Usage Trend

PromQL: container_memory_usage_bytes{namespace="healthpulse-prod", container!="", container!="POD"}
Type: Time series
Unit: bytes (Standard options → Unit → Data → bytes)
Memory is a gauge (goes up and down) — no rate() needed.

Panel 4: Pod Restart Count

PromQL: kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"}
Type: Stat or Table
High restart count = CrashLoopBackOff. Healthy pods show 0.

Panel 5: HPA Replica Count

PromQL: kube_horizontalpodautoscaler_status_current_replicas{namespace="healthpulse-prod"}
Type: Time series
Optionally add a second query: kube_horizontalpodautoscaler_status_desired_replicas{namespace="healthpulse-prod"}
When these lines diverge, the cluster is actively scaling.

Save the dashboard (disk icon at top).

Step 8: Set Up Alerts in Grafana

8.1 — Alert 1: Pod Crash-Looping

Alerting → Alert rules → New alert rule
Name: Pod CrashLooping — HealthPulse

PromQL:

increase(kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"}[5m]) > 3

Evaluate every: 1m | For: 5m
Label: severity: warning

Fires when any HealthPulse pod restarts more than 3 times in 5 minutes.

8.2 — Alert 2: High CPU

Name: High CPU — HealthPulse Prod

PromQL:

(sum(rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m])) by (pod)) > 0.8

Evaluate every: 1m | For: 5m
Label: severity: critical

8.3 — Configure Notification Channel

Alerting → Contact points → New contact point
Name: HealthPulse Notifications
Type: Email (enter your address) or Slack (enter webhook URL, channel #healthpulse-alerts)
Notification policies → edit default policy to use your contact point

Step 9: Install and Use k9s

k9s is a terminal-based Kubernetes UI — like htop for your cluster.

9.1 — Install

Mac: brew install derailed/k9s/k9s Linux: curl -sS https://webinstall.dev/k9s | bash Windows (Scoop): scoop install k9s

9.2 — Launch

KUBECONFIG=~/.kube/healthpulse-config k9s

9.3 — Navigation Commands

Type : to enter command mode:

Command	View
`:pods`	All pods
`:deploy`	Deployments
`:svc`	Services
`:ns`	Namespaces
`:nodes`	Cluster nodes
`:hpa`	Horizontal Pod Autoscalers
`:events`	Recent cluster events

9.4 — Actions on Resources

Key	Action
`l`	View logs
`s`	Shell into pod
`d`	Describe resource
`y`	View YAML
`/`	Filter/search
`Enter`	Drill into resource
`Esc`	Go back
`:q`	Quit

9.5 — Cross-Reference with Grafana

Metric	k9s	Grafana
Pod count	`:pods` — count rows	Pod Count panel
CPU	`:pods` — CPU column	CPU Usage panel
Restarts	`:pods` — RESTARTS column	Restart Count panel
HPA replicas	`:hpa` — REPLICAS column	HPA panel

When to use which: k9s for real-time interactive checks ("What's happening NOW?"). Grafana for trends and history ("What happened over the last 6 hours?").

Step 10: Compare Datadog vs Prometheus

Aspect	Datadog (Task D)	Prometheus + Grafana (Task K)
Scope	Infrastructure/OS metrics	Kubernetes-native metrics
Where it runs	Agent on each EC2	Pods inside k3s cluster
Installed via	Ansible	Helm
Data storage	Datadog cloud (SaaS)	In-cluster (Prometheus pod)
Dashboards	Datadog web UI	Grafana (self-hosted)
Query language	Datadog queries	PromQL
Alerting	Datadog Monitors	Alertmanager + Grafana
Cost	Free for 5 hosts, paid beyond	Free and open source
Industry	Enterprise (Netflix, Airbnb)	Cloud-native standard (CNCF)

When to use which (they complement each other):

Scenario	Best Tool
EC2 running out of disk?	Datadog
Pods keep restarting?	Prometheus
HPA replica count?	Prometheus
Nginx healthy on bare-metal?	Datadog
Server went down?	Datadog
Pod crash-loop alert?	Prometheus/Grafana

The real-world pattern:

Layer 1: Infrastructure monitoring (Datadog) → "Are my machines healthy?"
Layer 2: Kubernetes monitoring (Prometheus)  → "Are my applications healthy?"
Layer 3: Application monitoring (APM)        → "Are my users happy?"

You've now built Layers 1 and 2.

Step 11: Document in MkDocs

11.1 — Add to Architecture Documentation

Add a monitoring section to docs/architecture.md describing the two-layer monitoring approach and Prometheus components.

11.2 — Add an ADR

Create docs/adr/007-monitoring-tools.md:

Decision: Datadog for infrastructure, Prometheus + Grafana for Kubernetes, k9s for interactive management
Rationale: Datadog watches machines, Prometheus watches workloads, they complement each other
Consequences: Two systems to maintain, Prometheus retention limited by in-cluster storage

11.3 — Screenshots

Add screenshots of:

Node Exporter / Nodes dashboard
Namespace (Pods) dashboard for healthpulse-prod
Your custom HealthPulse dashboard

Acceptance Criteria Checklist

Helm installed (helm version)
kube-prometheus-stack deployed (kubectl get pods -n monitoring — all Running)
Grafana accessible at localhost:3000 via port-forward
Pre-built dashboards explored (Node Exporter, Namespace Pods, Networking)
Custom dashboard created with 5 panels (pod count, CPU, memory, restarts, HPA)
At least one alert rule configured (crash-loop or high CPU)
k9s installed and can navigate cluster resources
Datadog vs Prometheus comparison understood and documented

Instructor Verification

Be prepared to:

Show kubectl get pods -n monitoring and explain each component
Open Grafana and navigate pre-built dashboards — explain the metrics
Show your custom dashboard and explain each PromQL query
Demo k9s — navigate pods, view logs, describe a resource
Explain: Why both Datadog AND Prometheus? What does each monitor?
Cross-reference: Show the same metric in both Datadog and Grafana

Troubleshooting

Pods stuck in Pending (monitoring namespace)

kubectl describe pod <POD_NAME> -n monitoring
# Check Events section

Insufficient resources: Monitoring needs CPU/memory. On t3.small, the cluster may be tight:
```
kubectl top nodes
```
Fix: Scale down dev replicas temporarily or use t3.medium instances.
PVC pending: Check kubectl get pvc -n monitoring. k3s local-path-provisioner should auto-bind.

Grafana shows "No data"

# Verify Prometheus is running and scraping
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets — all should show UP

Port-forward disconnects

Just re-run:

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

PromQL syntax errors

Error	Fix
`parse error`	PromQL is case-sensitive — check typos
`unknown metric`	Use Explore → Metrics browser to find correct names
`no data`	Label filter doesn't match — start broad, add filters gradually
`rate() requires counter`	Remove `rate()` — you're using it on a gauge metric

Helm install fails

# Check the release status
helm status monitoring -n monitoring

# Check what revision it's on and if it errored
helm history monitoring -n monitoring

# If stuck in a failed state, uninstall and retry:
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring

# Wait for namespace to fully terminate (can take 30-60 seconds)
kubectl get namespace monitoring   # run until it disappears

# Reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Want to see exactly what Helm installed?

# List all Kubernetes resources created by this release
helm get manifest monitoring -n monitoring

# See which values are active (your overrides merged with defaults)
helm get values monitoring -n monitoring --all

k9s connection issues

# Test kubectl first
kubectl get nodes
# If kubectl works but k9s doesn't:
k9s --kubeconfig ~/.kube/healthpulse-config

Key Concepts Reference

Concept	What It Means
Prometheus	Open-source monitoring that scrapes metrics from targets and stores them as time-series. CNCF graduated project.
PromQL	Prometheus Query Language. Like SQL but for time-series. Examples: `rate(...)`, `count(...) by (label)`
Time series	Data points indexed by timestamp. Example: CPU at 10:00=45%, 10:01=47%. Every Prometheus metric is a time series.
Scraping	Prometheus PULLS data from targets by calling `/metrics` endpoints at regular intervals.
Exporter	Component that exposes metrics in Prometheus format. Node Exporter = OS metrics. kube-state-metrics = K8s objects.
Grafana	Visualization platform. Connects to Prometheus and renders dashboards with graphs, tables, alerts.
Helm	Package manager for Kubernetes. Like `apt` or `brew` but for Kubernetes apps. One command installs dozens of coordinated manifests.
Helm chart	A package of templated Kubernetes manifests + default values. kube-prometheus-stack = Prometheus + Grafana + exporters + RBAC in one package.
Helm release	A running instance of a chart in your cluster. Named by you at install time (`monitoring`). You can install the same chart multiple times with different names.
Helm repository	A server hosting a collection of charts. Like npm registry or apt sources. Add with `helm repo add`.
values.yml	Your configuration overrides for a chart. Merged with the chart's defaults at install time. Only specify what you want to change.
helm template	Renders the YAML Helm would apply — without touching the cluster. Use for previewing and debugging.
helm upgrade	Apply changed values or a newer chart version to an existing release. Helm diffs and applies only what changed.
helm rollback	Restore a release to a previous revision. Works because Helm stores release history as cluster secrets.
Alertmanager	Receives alerts from Prometheus, deduplicates them, routes to email/Slack/PagerDuty.
k9s	Terminal-based Kubernetes UI. Real-time interactive view without typing kubectl commands.
Counter vs Gauge	Counter only goes up (total restarts) — use `rate()`. Gauge goes up and down (current memory) — read directly.

Thursday, 7 May 2026

TASK I: Kubernetes Monitoring