In this task, you add Kubernetes-native monitoring to your k3s cluster using Prometheus (metrics collection), Grafana (visualization), and k9s (terminal UI). This complements the Datadog infrastructure monitoring from Task D.
Why both Datadog AND Prometheus?
- Datadog (Task D): Monitors the operating system — CPU, memory, disk, network. Installed via Ansible. Runs as a system service. Data goes to Datadog's cloud (SaaS).
- Prometheus (Task K): Monitors Kubernetes objects — pods, deployments, containers, HPA. Runs INSIDE the cluster as pods. Data stays in-cluster.
Think of it this way: Datadog is the building security camera (watches the physical servers). Prometheus is the restaurant manager (watches the kitchen staff — your pods).
What you'll do:
- Understand the monitoring architecture
- Learn Helm — charts, releases, repositories, values, upgrade, rollback
- Install kube-prometheus-stack via Helm (Helm as the real-world use case)
- Access Grafana and explore pre-built dashboards
- Monitor HealthPulse pods and observe scaling events
- Create a custom HealthPulse dashboard with PromQL
- Set up alerts in Grafana
- Use k9s alongside Grafana for cluster management
- Compare Datadog vs Prometheus
- Document everything in MkDocs
Before starting Task K, ensure you have completed:
- k3s cluster running (1 master + 2 workers), application deployed
- Datadog agents installed on all nodes (OS-level monitoring)
-
kubectlconfigured withKUBECONFIG=~/.kube/healthpulse-config - Application deployed in
healthpulse-dev,healthpulse-uat,healthpulse-prodnamespaces
Verify your cluster:
export KUBECONFIG=~/.kube/healthpulse-config
kubectl get nodes
# All 3 nodes should show Ready┌──────────────────────────────────────────────────────────────────┐
│ k3s Cluster │
│ │
│ ┌─── monitoring namespace ────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────┐ scrapes ┌────────────────────────────┐ │ │
│ │ │ Prometheus │◄─every───│ kube-state-metrics │ │ │
│ │ │ Server │ 30s │ (pod/deploy/svc counts) │ │ │
│ │ │ │ └────────────────────────────┘ │ │
│ │ │ Stores all │ scrapes ┌────────────────────────────┐ │ │
│ │ │ time-series │◄─every───│ Node Exporter (DaemonSet) │ │ │
│ │ │ metrics │ 30s │ (CPU/mem/disk per node) │ │ │
│ │ └──────┬───────┘ └────────────────────────────┘ │ │
│ │ │ PromQL queries │ │
│ │ ▼ ┌────────────────────────────┐ │ │
│ │ ┌──────────────┐ │ Alertmanager │ │ │
│ │ │ Grafana │ │ (email/Slack alerts) │ │ │
│ │ │ Dashboards │ └────────────────────────────┘ │ │
│ │ └──────┬───────┘ │ │
│ └─────────┼───────────────────────────────────────────────────┘ │
│ │ port-forward :3000 │
│ ▼ │
│ YOUR BROWSER → localhost:3000 │
│ │
│ ┌─── healthpulse-prod ──┐ │
│ │ Pod 1 │ Pod 2 │ ◄── Prometheus scrapes automatically │
│ └─────────┴─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
The scraping model: Prometheus uses a pull model — it calls each target's /metrics HTTP endpoint every 30 seconds, collects the data, and stores it as time-series. Grafana then queries Prometheus using PromQL.
Key insight: Datadog agents PUSH metrics to the cloud. Prometheus PULLS metrics from targets. Different approach, same goal.
Before Helm, installing something like a monitoring stack on Kubernetes meant managing dozens of YAML files by hand. Consider what the Prometheus + Grafana stack actually requires:
Without Helm — you manage ALL of this manually:
├── prometheus-deployment.yml
├── prometheus-service.yml
├── prometheus-configmap.yml (scrape config, 200+ lines)
├── prometheus-rbac.yml (ClusterRole, ClusterRoleBinding, ServiceAccount)
├── prometheus-persistentvolume.yml
├── grafana-deployment.yml
├── grafana-service.yml
├── grafana-configmap.yml (datasources, dashboards)
├── grafana-secret.yml (admin password)
├── alertmanager-deployment.yml
├── alertmanager-configmap.yml
├── alertmanager-service.yml
├── node-exporter-daemonset.yml (runs on every node)
├── node-exporter-service.yml
├── kube-state-metrics-deployment.yml
├── kube-state-metrics-rbac.yml
└── ... (20+ more files)
And every time you upgrade, you diff all of them manually. Every environment (dev, UAT, prod) needs its own copy with slightly different values.
Helm solves this — all 20+ manifests become one install command, with configuration in one place.
Helm is the package manager for Kubernetes — like apt for Ubuntu or brew for Mac, but for Kubernetes applications.
apt install nginx → helm install nginx ingress-nginx/ingress-nginx
brew install postgresql → helm install postgres bitnami/postgresql
npm install react → helm install monitoring prometheus-community/kube-prometheus-stack
Three concepts to understand:
Chart — A Helm package. Contains templated Kubernetes YAML files + default configuration. Think of it like a .deb package (apt) or a formula (brew).
Release — A running instance of a chart in your cluster. You can install the same chart multiple times with different names and configs. Each installation is a separate release.
Repository — A collection of charts, hosted on a URL. Like npm registry or apt sources.
Repository Chart Release
───────────────────── ────────────────────── ─────────────────────
prometheus-community → kube-prometheus-stack → "monitoring" (your name)
(chart server) (the package) (running in cluster)
┌─────────────────────────────────────────────────────┐
│ Chart Repository (remote server) │
│ e.g. prometheus-community.github.io/helm-charts │
│ ├── kube-prometheus-stack-65.1.0.tgz │
│ ├── kube-prometheus-stack-64.0.0.tgz │
│ └── ... │
└────────────────────┬────────────────────────────────┘
│ helm repo add / helm install
▼
┌─────────────────────────────────────────────────────┐
│ Your Machine (Helm CLI) │
│ ├── ~/.cache/helm/repository/ (cached charts) │
│ └── reads KUBECONFIG → talks to cluster API │
└────────────────────┬────────────────────────────────┘
│ generates + applies YAML
▼
┌─────────────────────────────────────────────────────┐
│ k3s Cluster │
│ └── monitoring namespace │
│ ├── Deployment: grafana │
│ ├── Deployment: kube-state-metrics │
│ ├── StatefulSet: prometheus │
│ ├── DaemonSet: node-exporter │
│ ├── Service: monitoring-grafana │
│ └── ... (20+ resources, all managed by Helm) │
└─────────────────────────────────────────────────────┘
Helm also stores release history as secrets inside the cluster — which is how helm rollback works.
On the k3s master (or your laptop):
Mac:
brew install helmLinux (including EC2 master):
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bashWindows:
scoop install helm
# or
choco install kubernetes-helmVerify:
helm version
# → version.BuildInfo{Version:"v3.x.x", ...}Helm uses your
KUBECONFIG— whichever clusterkubectltalks to, Helm talks to as well. No separate config needed.
Before installing anything, get familiar with the Helm CLI:
# ── Repositories ─────────────────────────────────────────────────
helm repo add <name> <url> # Add a chart repository
helm repo update # Refresh the local index from all repos
helm repo list # Show all added repositories
helm search repo <keyword> # Search charts across all repos
# ── Inspecting Charts (before installing) ────────────────────────
helm show chart <repo/chart> # Show chart metadata (name, version, description)
helm show values <repo/chart> # Show ALL configurable values (can be 1000+ lines)
helm template <name> <repo/chart> --values <file> # Preview the YAML that would be applied
# ── Installing & Managing ─────────────────────────────────────────
helm install <release> <repo/chart> [flags] # Install a chart
helm upgrade <release> <repo/chart> [flags] # Upgrade an existing release
helm rollback <release> [revision] # Rollback to a previous version
helm uninstall <release> -n <namespace> # Remove a release
# ── Inspecting Releases ──────────────────────────────────────────
helm list -n <namespace> # List all releases in a namespace
helm list -A # List releases across all namespaces
helm status <release> -n <namespace> # Show release status and notes
helm get values <release> -n <namespace> # Show values used for a release
helm get manifest <release> -n <namespace> # Show all YAML applied by a release
helm history <release> -n <namespace> # Show upgrade/rollback historyYou'll use all of these in the steps below.
Now apply what you just learned. You'll install the entire Prometheus + Grafana monitoring stack using a single Helm chart.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo updateVerify it's there:
helm repo listNAME URL
prometheus-community https://prometheus-community.github.io/helm-charts
Search the repo to confirm the chart exists:
helm search repo prometheus-community/kube-prometheus-stackNAME CHART VERSION APP VERSION DESCRIPTION
prometheus-community/kube-prometheus-stack 65.1.0 v0.79.2 kube-prometheus-stack collects Kubernetes...
This tells you the chart version (65.1.0) and what app version it installs (Prometheus v0.79.2). Pin the chart version in production to prevent unexpected upgrades.
Always look before you install. Two commands to know:
# Show chart metadata — what it contains, dependencies, maintainers
helm show chart prometheus-community/kube-prometheus-stack# Show all configurable values — scroll through to understand what can be changed
helm show values prometheus-community/kube-prometheus-stack | head -100The values output is typically 1000+ lines. This is your configuration reference — every setting you might want to override is listed here with its default.
Instead of passing every option as --set flags on the command line, Helm lets you put all your configuration overrides in a values.yml file.
Open kubernetes/monitoring/values.yml:
grafana:
adminUser: admin
adminPassword: healthpulse123 # override the default random password
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 300m
memory: 256Mi
service:
type: ClusterIP # use port-forward, not LoadBalancer
prometheus:
prometheusSpec:
retention: 7d # keep metrics for 7 days
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
storageSpec: {} # emptyDir — data lost on pod restart (fine for capstone)
alertmanager:
alertmanagerSpec:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128MiHow values work:
Chart defaults (values.yaml in chart)
+
Your overrides (kubernetes/monitoring/values.yml)
=
Final configuration applied to cluster
You only need to specify what you want to change. Everything else uses the chart's defaults. This is the key advantage over managing raw YAML — you only touch what matters to you.
--set vs -f values.yml:
| Method | Use When |
|---|---|
--set key=value | One or two quick overrides, testing |
-f values.yml | Multiple settings, version-controlled config |
For this capstone, we use a values.yml file so your configuration is committed to Git alongside the rest of the code.
Before actually installing, use helm template to render the YAML that Helm would apply. This is useful for debugging and learning:
helm template monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f kubernetes/monitoring/values.yml | head -80This outputs all the Kubernetes YAML that will be created — Deployments, Services, ConfigMaps, RBAC — without touching your cluster. Pipe it to a file to read through it:
helm template monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f kubernetes/monitoring/values.yml > /tmp/monitoring-preview.yml
wc -l /tmp/monitoring-preview.yml # see how many lines Helm generates for youTeaching moment: Count the lines. The chart generates thousands of lines of YAML that you would otherwise maintain by hand. This is what Helm does for you.
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f kubernetes/monitoring/values.ymlBreaking down the command:
| Part | What It Does |
|---|---|
helm install | Install a chart as a new release |
monitoring | Release name — your chosen name for this installation. Use this in helm upgrade, helm status, helm uninstall |
prometheus-community/kube-prometheus-stack | <repo-name>/<chart-name> — which chart to install |
--namespace monitoring | Install all resources into this namespace |
--create-namespace | Create the namespace if it doesn't exist (saves a separate kubectl create namespace) |
-f kubernetes/monitoring/values.yml | Apply your configuration overrides |
Helm prints a summary and release notes when done — read them. They often contain the next steps (like how to access the UI).
# See the release is deployed
helm list -n monitoringNAME NAMESPACE REVISION STATUS CHART APP VERSION
monitoring monitoring 1 deployed kube-prometheus-stack-65.1.0 v0.79.2
# Show status and the release notes
helm status monitoring -n monitoring# Show what values are actually in use (your overrides merged with chart defaults)
helm get values monitoring -n monitoring# Show the full YAML that was applied to the cluster
helm get manifest monitoring -n monitoring | head -50# Show release history (useful later when you upgrade)
helm history monitoring -n monitoringREVISION STATUS CHART DESCRIPTION
1 deployed kube-prometheus-stack-65.1.0 Install complete
kubectl get pods -n monitoringWait 2–3 minutes. All pods should show Running:
NAME READY STATUS AGE
alertmanager-monitoring-kube-prometheus-alertmanager-0 2/2 Running 2m
monitoring-grafana-7f8c9d6b4-xxxxx 3/3 Running 2m
monitoring-kube-prometheus-operator-6b4c9f8d7-xxxxx 1/1 Running 2m
monitoring-kube-state-metrics-5c6d8f9b7-xxxxx 1/1 Running 2m
monitoring-prometheus-node-exporter-xxxxx 1/1 Running 2m
monitoring-prometheus-node-exporter-xxxxx 1/1 Running 2m
monitoring-prometheus-node-exporter-xxxxx 1/1 Running 2m
prometheus-monitoring-kube-prometheus-prometheus-0 2/2 Running 2m
| Component | Kind | What It Does |
|---|---|---|
| Prometheus | StatefulSet | Scrapes and stores all metrics |
| Grafana | Deployment | Dashboard UI for visualizing metrics |
| Node Exporter | DaemonSet | Exposes OS metrics from each node (one pod per node — that's why you see 3) |
| kube-state-metrics | Deployment | Exposes Kubernetes object metrics (pod count, deploy status) |
| Alertmanager | StatefulSet | Routes alerts to email, Slack, PagerDuty |
| Prometheus Operator | Deployment | Manages Prometheus config via CRDs — the "brain" of the stack |
Notice the 3 node-exporter pods. Node Exporter is a DaemonSet — Kubernetes automatically runs one copy on every node. Helm configured this for you.
kubectl get svc -n monitoringCheckpoint: All pods Running and services created. One command installed 8 pods, 10+ services, ConfigMaps, RBAC, and CRDs. That is what Helm does.
Later, if you want to change the Grafana password or increase Prometheus retention, edit kubernetes/monitoring/values.yml and run:
helm upgrade monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f kubernetes/monitoring/values.ymlHelm applies only the diff — it doesn't reinstall everything. The revision number increments:
helm history monitoring -n monitoringREVISION STATUS CHART DESCRIPTION
1 superseded kube-prometheus-stack-65.1.0 Install complete
2 deployed kube-prometheus-stack-65.1.0 Upgrade complete
If an upgrade breaks something, rollback to the previous revision:
# Rollback to revision 1
helm rollback monitoring 1 -n monitoring
# Verify
helm history monitoring -n monitoringREVISION STATUS DESCRIPTION
1 superseded Install complete
2 superseded Upgrade complete
3 deployed Rollback to 1
This is possible because Helm stores release history as Kubernetes secrets. Each revision is saved — Helm can reconstruct any previous state. Compare this to manually applying
kubectl apply— there is no rollback history at all.
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring
# Wait for namespace to terminate
kubectl get namespace monitoring # keep running until it disappears
# Then reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f kubernetes/monitoring/values.yml4.1.1 — Option A: Traefik Ingress (Recommended — DNS required)
Grafana gets its own subdomain routed through Traefik, exactly like your application environments. No port-forward needed, no open terminal.
The wildcard DNS record (*.team-healthpulse.com) already covers grafana.team-healthpulse.com — no Terraform changes needed.
Apply the Grafana ingress:
kubectl apply -f kubernetes/ingress-grafana.yml
# Verify Traefik picked it up
kubectl get ingress -n monitoringNAME CLASS HOSTS ADDRESS PORTS AGE
grafana-ingress traefik grafana.team-healthpulse.com 10.43.0.1 80 10s
Open http://grafana.team-healthpulse.com in your browser.
Why this works without DNS changes: The Terraform DNS config includes a wildcard A record (
*.team-healthpulse.com → k3s master EIP). Any subdomain not explicitly defined — includinggrafana— automatically resolves to the k3s master, where Traefik picks it up and routes it based on the Ingress hostname rule.
4.1.2 — Port-Forward
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 -n monitoring #local
kubectl port-forward --address 0.0.0.0 svc/monitoring-grafana 3000:80 -n monitoring#ec2 master
Keep this terminal running. Open a new terminal for other commands.
- Open browser: http://localhost:3000
- Username:
admin/ Password:healthpulse123
| Area | Where | What It Shows |
|---|---|---|
| Dashboards | Left sidebar → Dashboards | Browse all pre-built and custom dashboards |
| Explore | Left sidebar → Explore | Free-form PromQL query editor |
| Alerting | Left sidebar → Alerting | Alert rules, notification channels |
| Data Sources | Settings → Data Sources | Prometheus (pre-configured) |
The kube-prometheus-stack ships with 20+ production-grade dashboards.
- Click Dashboards → Browse
- Look for dashboards starting with
Kubernetes /andNode Exporter /
Select a namespace from the dropdown to see:
| Panel | What It Shows |
|---|---|
| CPU Usage | CPU consumed by each pod — are pods CPU-starved? |
| CPU Quota | CPU requested vs actual — over-provisioning or under-provisioning? |
| Memory Usage | Memory per pod — approaching limits? (OOMKill risk) |
| Network I/O | Bytes sent/received per pod |
Shows each EC2 instance's resources — compare with Datadog:
| Panel | Datadog Equivalent |
|---|---|
| CPU Busy | system.cpu.user |
| Memory Usage | system.mem.used |
| Disk Space | system.disk.in_use |
| Network Traffic | system.net.bytes_rcvd |
Key insight: The numbers from Node Exporter and Datadog should be very close. Cross-referencing validates both monitoring systems.
Shows receive/transmit bandwidth and packet rates per pod.
- Open Kubernetes / Compute Resources / Namespace (Pods) dashboard
- Select namespace:
healthpulse-prod - Observe CPU and memory for every production pod
Terminal 1: Keep Grafana open on the namespace dashboard
Terminal 2: Deploy a new version:
kubectl set image deployment/healthpulse-portal \
healthpulse-portal=<ARTIFACTORY_URL>/healthpulse-portal:2.0.0 \
-n healthpulse-prodWatch Grafana — old pod lines end, new pod lines appear, with a brief overlap (rolling update).
# Generate load
kubectl run load-test --image=busybox -n healthpulse-prod --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://healthpulse-service/health; done"
# In another terminal — watch HPA scale
kubectl get hpa -n healthpulse-prod -w
# Clean up when done
kubectl delete pod load-test -n healthpulse-prodWatch Grafana: CPU climbs, new pod lines appear as HPA scales up.
- Click Dashboards → New → New Dashboard
- Rename to: HealthPulse — Kubernetes Overview
- PromQL:
count(kube_pod_info) by (namespace) - Type: Bar gauge or Stat
- Counts pods in each namespace using kube-state-metrics data.
- PromQL:
rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m]) - Type: Time series
rate()converts a cumulative counter into per-second usage over 5 minutes. Thecontainer!="POD"filter excludes Kubernetes pause containers.
- PromQL:
container_memory_usage_bytes{namespace="healthpulse-prod", container!="", container!="POD"} - Type: Time series
- Unit: bytes (Standard options → Unit → Data → bytes)
- Memory is a gauge (goes up and down) — no
rate()needed.
- PromQL:
kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"} - Type: Stat or Table
- High restart count = CrashLoopBackOff. Healthy pods show 0.
- PromQL:
kube_horizontalpodautoscaler_status_current_replicas{namespace="healthpulse-prod"} - Type: Time series
- Optionally add a second query:
kube_horizontalpodautoscaler_status_desired_replicas{namespace="healthpulse-prod"} - When these lines diverge, the cluster is actively scaling.
Save the dashboard (disk icon at top).
- Alerting → Alert rules → New alert rule
- Name:
Pod CrashLooping — HealthPulse - PromQL:
increase(kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"}[5m]) > 3 - Evaluate every:
1m| For:5m - Label:
severity: warning
Fires when any HealthPulse pod restarts more than 3 times in 5 minutes.
- Name:
High CPU — HealthPulse Prod - PromQL:
(sum(rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m])) by (pod)) > 0.8 - Evaluate every:
1m| For:5m - Label:
severity: critical
- Alerting → Contact points → New contact point
- Name:
HealthPulse Notifications - Type: Email (enter your address) or Slack (enter webhook URL, channel
#healthpulse-alerts) - Notification policies → edit default policy to use your contact point
k9s is a terminal-based Kubernetes UI — like htop for your cluster.
Mac: brew install derailed/k9s/k9s Linux: curl -sS https://webinstall.dev/k9s | bash Windows (Scoop): scoop install k9s
KUBECONFIG=~/.kube/healthpulse-config k9sType : to enter command mode:
| Command | View |
|---|---|
:pods | All pods |
:deploy | Deployments |
:svc | Services |
:ns | Namespaces |
:nodes | Cluster nodes |
:hpa | Horizontal Pod Autoscalers |
:events | Recent cluster events |
| Key | Action |
|---|---|
l | View logs |
s | Shell into pod |
d | Describe resource |
y | View YAML |
/ | Filter/search |
Enter | Drill into resource |
Esc | Go back |
:q | Quit |
| Metric | k9s | Grafana |
|---|---|---|
| Pod count | :pods — count rows | Pod Count panel |
| CPU | :pods — CPU column | CPU Usage panel |
| Restarts | :pods — RESTARTS column | Restart Count panel |
| HPA replicas | :hpa — REPLICAS column | HPA panel |
When to use which: k9s for real-time interactive checks ("What's happening NOW?"). Grafana for trends and history ("What happened over the last 6 hours?").
| Aspect | Datadog (Task D) | Prometheus + Grafana (Task K) |
|---|---|---|
| Scope | Infrastructure/OS metrics | Kubernetes-native metrics |
| Where it runs | Agent on each EC2 | Pods inside k3s cluster |
| Installed via | Ansible | Helm |
| Data storage | Datadog cloud (SaaS) | In-cluster (Prometheus pod) |
| Dashboards | Datadog web UI | Grafana (self-hosted) |
| Query language | Datadog queries | PromQL |
| Alerting | Datadog Monitors | Alertmanager + Grafana |
| Cost | Free for 5 hosts, paid beyond | Free and open source |
| Industry | Enterprise (Netflix, Airbnb) | Cloud-native standard (CNCF) |
When to use which (they complement each other):
| Scenario | Best Tool |
|---|---|
| EC2 running out of disk? | Datadog |
| Pods keep restarting? | Prometheus |
| HPA replica count? | Prometheus |
| Nginx healthy on bare-metal? | Datadog |
| Server went down? | Datadog |
| Pod crash-loop alert? | Prometheus/Grafana |
The real-world pattern:
Layer 1: Infrastructure monitoring (Datadog) → "Are my machines healthy?"
Layer 2: Kubernetes monitoring (Prometheus) → "Are my applications healthy?"
Layer 3: Application monitoring (APM) → "Are my users happy?"
You've now built Layers 1 and 2.
Add a monitoring section to docs/architecture.md describing the two-layer monitoring approach and Prometheus components.
Create docs/adr/007-monitoring-tools.md:
- Decision: Datadog for infrastructure, Prometheus + Grafana for Kubernetes, k9s for interactive management
- Rationale: Datadog watches machines, Prometheus watches workloads, they complement each other
- Consequences: Two systems to maintain, Prometheus retention limited by in-cluster storage
Add screenshots of:
- Node Exporter / Nodes dashboard
- Namespace (Pods) dashboard for
healthpulse-prod - Your custom HealthPulse dashboard
- Helm installed (
helm version) - kube-prometheus-stack deployed (
kubectl get pods -n monitoring— all Running) - Grafana accessible at localhost:3000 via port-forward
- Pre-built dashboards explored (Node Exporter, Namespace Pods, Networking)
- Custom dashboard created with 5 panels (pod count, CPU, memory, restarts, HPA)
- At least one alert rule configured (crash-loop or high CPU)
- k9s installed and can navigate cluster resources
- Datadog vs Prometheus comparison understood and documented
Be prepared to:
- Show
kubectl get pods -n monitoringand explain each component - Open Grafana and navigate pre-built dashboards — explain the metrics
- Show your custom dashboard and explain each PromQL query
- Demo k9s — navigate pods, view logs, describe a resource
- Explain: Why both Datadog AND Prometheus? What does each monitor?
- Cross-reference: Show the same metric in both Datadog and Grafana
kubectl describe pod <POD_NAME> -n monitoring
# Check Events section- Insufficient resources: Monitoring needs CPU/memory. On t3.small, the cluster may be tight:Fix: Scale down dev replicas temporarily or use t3.medium instances.
kubectl top nodes
- PVC pending: Check
kubectl get pvc -n monitoring. k3s local-path-provisioner should auto-bind.
# Verify Prometheus is running and scraping
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets — all should show UPJust re-run:
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80| Error | Fix |
|---|---|
parse error | PromQL is case-sensitive — check typos |
unknown metric | Use Explore → Metrics browser to find correct names |
no data | Label filter doesn't match — start broad, add filters gradually |
rate() requires counter | Remove rate() — you're using it on a gauge metric |
# Check the release status
helm status monitoring -n monitoring
# Check what revision it's on and if it errored
helm history monitoring -n monitoring
# If stuck in a failed state, uninstall and retry:
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring
# Wait for namespace to fully terminate (can take 30-60 seconds)
kubectl get namespace monitoring # run until it disappears
# Reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f kubernetes/monitoring/values.yml# List all Kubernetes resources created by this release
helm get manifest monitoring -n monitoring
# See which values are active (your overrides merged with defaults)
helm get values monitoring -n monitoring --all# Test kubectl first
kubectl get nodes
# If kubectl works but k9s doesn't:
k9s --kubeconfig ~/.kube/healthpulse-config| Concept | What It Means |
|---|---|
| Prometheus | Open-source monitoring that scrapes metrics from targets and stores them as time-series. CNCF graduated project. |
| PromQL | Prometheus Query Language. Like SQL but for time-series. Examples: rate(...), count(...) by (label) |
| Time series | Data points indexed by timestamp. Example: CPU at 10:00=45%, 10:01=47%. Every Prometheus metric is a time series. |
| Scraping | Prometheus PULLS data from targets by calling /metrics endpoints at regular intervals. |
| Exporter | Component that exposes metrics in Prometheus format. Node Exporter = OS metrics. kube-state-metrics = K8s objects. |
| Grafana | Visualization platform. Connects to Prometheus and renders dashboards with graphs, tables, alerts. |
| Helm | Package manager for Kubernetes. Like apt or brew but for Kubernetes apps. One command installs dozens of coordinated manifests. |
| Helm chart | A package of templated Kubernetes manifests + default values. kube-prometheus-stack = Prometheus + Grafana + exporters + RBAC in one package. |
| Helm release | A running instance of a chart in your cluster. Named by you at install time (monitoring). You can install the same chart multiple times with different names. |
| Helm repository | A server hosting a collection of charts. Like npm registry or apt sources. Add with helm repo add. |
| values.yml | Your configuration overrides for a chart. Merged with the chart's defaults at install time. Only specify what you want to change. |
| helm template | Renders the YAML Helm would apply — without touching the cluster. Use for previewing and debugging. |
| helm upgrade | Apply changed values or a newer chart version to an existing release. Helm diffs and applies only what changed. |
| helm rollback | Restore a release to a previous revision. Works because Helm stores release history as cluster secrets. |
| Alertmanager | Receives alerts from Prometheus, deduplicates them, routes to email/Slack/PagerDuty. |
| k9s | Terminal-based Kubernetes UI. Real-time interactive view without typing kubectl commands. |
| Counter vs Gauge | Counter only goes up (total restarts) — use rate(). Gauge goes up and down (current memory) — read directly. |
No comments:
Post a Comment