Interview Masters Journal

Interview Q&A15 min readFebruary 10, 2026

DevOps & SRE Interview Questions: CI/CD, Kubernetes, Observability & On-Call

Site Reliability Engineering interviews test incident management, infrastructure automation, and reliability principles. Here's the complete prep guide for DevOps and SRE roles.

#DevOps#SRE#Kubernetes#CI/CD#Observability#Reliability

Written by

Sam Keller

Senior SRE @ Cloudflare

Published: February 10, 2026
Reading time: 15 min read
Focus: Interview Q&A

Inside this guide

Introduction
Reliability Principles & SRE Concepts
Kubernetes

Article

Designed for focused reading on every screen size.

Introduction

SRE and DevOps interviews are unlike standard software engineering interviews. Yes, you will write code—but the questions probe your understanding of distributed systems failure modes, your ability to design for reliability, and your judgment during high-stakes incidents. Companies like Google, PagerDuty, Cloudflare, HashiCorp, and most cloud-native startups are aggressively hiring in this space.

Reliability Principles & SRE Concepts

Q: What is an SLO, SLA, and SLI? How do they relate?

SLI (Service Level Indicator): A specific metric that measures service behavior. Example: request success rate, P99 latency, error rate.
SLO (Service Level Objective): An internal reliability target. Example: "99.9% of requests succeed within 200ms over a 30-day window."
SLA (Service Level Agreement): An external contractual commitment with customers, usually less aggressive than the internal SLO. Breach of SLA triggers financial penalties.

The SLO should be set below what the system actually achieves—providing an error budget. If the SLO is 99.9% and you're achieving 99.95%, you have 0.05% error budget to spend on risky changes, experiments, and planned downtime.

Q: What is an error budget and how does it influence release velocity?

An error budget is the allowed unreliability within a given window. If your SLO is 99.9% over 30 days, your error budget is 0.1%—about 43 minutes of downtime.

When error budget is plentiful: move fast, deploy frequently, experiment. When error budget is depleted: freeze feature deployments, focus on reliability improvements until the window resets.

This creates a principled, data-driven mechanism for balancing reliability vs. velocity without perpetual political debates.

Kubernetes

Q: Explain the Kubernetes control plane components.

API Server: The central REST interface for all cluster operations. kubectl, controllers, and kubelets all communicate through it.
etcd: Distributed key-value store that holds all cluster state. If etcd is unhealthy, the cluster cannot make scheduling decisions.
Scheduler: Watches for unscheduled Pods and assigns them to Nodes based on resource requirements, affinity rules, and taints/tolerations.
Controller Manager: Runs controllers (Deployment, ReplicaSet, Node, etc.) that reconcile desired state with actual state.

Q: What is a Pod Disruption Budget (PDB) and when do you use it?

A PDB defines the minimum number of replicas that must remain available during voluntary disruptions (node drains, cluster upgrades). Without a PDB, a drain could remove all replicas of a service simultaneously.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2 # at least 2 replicas must be running during disruption
  selector:
matchLabels:
  app: api-server

Q: How do you handle a pod that is stuck in CrashLoopBackOff?

Diagnostic steps:

kubectl describe pod <name> — Check events section for resource limits, liveness probe failures, or image pull errors.
kubectl logs <name> --previous — View logs from the previous (crashed) container.
Check resource requests/limits — OOMKilled events appear in describe.
Verify the readiness/liveness probe configuration.
kubectl exec -it <name> -- /bin/sh (if container starts briefly) to inspect the filesystem or environment.

CI/CD Pipeline Design

Q: Walk me through a production-grade CI/CD pipeline.

Code Push → PR Lint & Unit Tests → Build Docker Image →
Integration Tests → Security Scan (Snyk/Trivy) →
Push to Registry → Deploy to Staging (Argo CD) →
Smoke Tests → Manual Approval Gate (for production) →
Blue/Green Deploy to Production → Automated Rollback on Error Rate Spike

Key principles: fail fast (linting before expensive tests), immutable artifacts (same image promoted through environments—never rebuild), automated rollback triggers (deployment controller watches error rate SLO and rolls back automatically).

Q: What is the difference between blue/green and canary deployments?

Blue/Green: Two identical environments. Traffic switches 0% → 100% from old to new instantly. Fast rollback (switch traffic back). High infrastructure cost (double the resources during deployment).

Canary: Incremental traffic shift—5% → 25% → 50% → 100%. Slower, but allows real-world validation at small blast radius. Ideal for high-traffic services where you want to soak-test with a subset of users.

Feature flags allow even finer-grained control: deploy code to 100% of servers but expose the feature to 1% of users, independent of deployment state.

Observability (Metrics, Logs, Traces)

Q: What are the three pillars of observability?

Metrics: Aggregated numerical data over time (Prometheus, Datadog, CloudWatch). Good for alerting and dashboards. Limited context on why something is wrong.
Logs: Structured event records (Loki, Splunk, Datadog Logs). Rich context but high volume; requires structured logging (JSON) and good indexing.
Traces: Distributed request traces (Jaeger, Zipkin, Datadog APM). Show the path of a request across services, revealing where latency is introduced.

Use metrics to detect problems, logs to investigate events, and traces to diagnose distributed latency.

Q: What is the RED method for monitoring microservices?

RED stands for:

Rate: Requests per second the service is handling.
Errors: Percentage of requests that are failing.
Duration: P50, P95, P99 latency of requests.

Every service should have dashboards and alerts covering these three signals. Compare to the USE method (for infrastructure resources): Utilization, Saturation, Errors.

Incident Management

Q: Walk me through how you would handle a production SEV1 incident.

Detect: Alert fires (or report comes in). Confirm the scope—how many users affected? Which services?
Declare: Page the on-call rotation. Open an incident channel (#incident-2024-0315). Assign Incident Commander (IC) role.
Mitigate first, investigate later: Can you roll back a recent deployment? Enable a feature flag to disable a broken feature? Add capacity? Take the fastest path to reducing user impact.
Communicate: Post regular updates to status page and internal stakeholders. Customers hate silence more than bad news.
Resolve: Confirm metrics return to baseline. Monitor for recurrence.
Postmortem: Blameless postmortem within 48 hours. 5 Whys analysis. Actionable follow-up items with owners and due dates.

Q: What makes a good postmortem?

Blameless: Systems fail, not people. Psychological safety is prerequisite for honest root cause analysis.
Timeline: Exact sequence of events with timestamps. "When did we first know something was wrong?"
Impact quantification: Users affected, revenue impact, SLO burn.
Root cause (not proximate cause): The deployment broke prod is a proximate cause. The root cause might be "no automated rollback on error rate threshold."
Action items: Specific, assigned, time-bounded. "Implement automated rollback by 2024-04-01, owned by [Name]."

Summary

SRE interviews reward engineers who think in systems, design for failure, and remain calm under pressure. Practice articulating reliability principles (SLOs, error budgets), demonstrate Kubernetes and observability depth, and have an incident you can walk through in vivid detail.

Practice DevOps & SRE interview questions on Interview Masters →

Keep reading

More practical interview prep and product-thinking guides from the same editorial track.

Interview Q&A10 min read

QA Automation Engineer Interview Questions and Answers

Practice the most important QA automation engineer interview questions for 2026, including Selenium, API testing, CI/CD, frameworks, and flaky test answers.

Read article

Interview Q&A14 min read

Python Interview Questions and Answers for 2026

Study Python interview questions and answers for 2026, from core language behavior and data structures to practical debugging and performance tradeoffs.

Read article

Put this into practice

Turn reading into interview reps.

Build role-specific questions, practice with AI, and reinforce the exact concepts you just reviewed while they are still fresh.

Role-specific promptsAI feedback loopsFast repeat practice

Generate Questions with AI

Back to Blog

Interview Masters Journal

Interview Q&A15 min readFebruary 10, 2026

DevOps & SRE Interview Questions: CI/CD, Kubernetes, Observability & On-Call

Site Reliability Engineering interviews test incident management, infrastructure automation, and reliability principles. Here's the complete prep guide for DevOps and SRE roles.

#DevOps#SRE#Kubernetes#CI/CD#Observability#Reliability

Written by

Sam Keller

Senior SRE @ Cloudflare

Published: February 10, 2026
Reading time: 15 min read
Focus: Interview Q&A

Inside this guide

Introduction
Reliability Principles & SRE Concepts
Kubernetes

Article

Designed for focused reading on every screen size.

Introduction

Reliability Principles & SRE Concepts

Q: What is an SLO, SLA, and SLI? How do they relate?

SLI (Service Level Indicator): A specific metric that measures service behavior. Example: request success rate, P99 latency, error rate.
SLO (Service Level Objective): An internal reliability target. Example: "99.9% of requests succeed within 200ms over a 30-day window."
SLA (Service Level Agreement): An external contractual commitment with customers, usually less aggressive than the internal SLO. Breach of SLA triggers financial penalties.

Q: What is an error budget and how does it influence release velocity?

An error budget is the allowed unreliability within a given window. If your SLO is 99.9% over 30 days, your error budget is 0.1%—about 43 minutes of downtime.

When error budget is plentiful: move fast, deploy frequently, experiment. When error budget is depleted: freeze feature deployments, focus on reliability improvements until the window resets.

This creates a principled, data-driven mechanism for balancing reliability vs. velocity without perpetual political debates.

Kubernetes

Q: Explain the Kubernetes control plane components.

API Server: The central REST interface for all cluster operations. kubectl, controllers, and kubelets all communicate through it.
etcd: Distributed key-value store that holds all cluster state. If etcd is unhealthy, the cluster cannot make scheduling decisions.
Scheduler: Watches for unscheduled Pods and assigns them to Nodes based on resource requirements, affinity rules, and taints/tolerations.
Controller Manager: Runs controllers (Deployment, ReplicaSet, Node, etc.) that reconcile desired state with actual state.

Q: What is a Pod Disruption Budget (PDB) and when do you use it?

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2 # at least 2 replicas must be running during disruption
  selector:
matchLabels:
  app: api-server

Q: How do you handle a pod that is stuck in CrashLoopBackOff?

Diagnostic steps:

kubectl describe pod <name> — Check events section for resource limits, liveness probe failures, or image pull errors.
kubectl logs <name> --previous — View logs from the previous (crashed) container.
Check resource requests/limits — OOMKilled events appear in describe.
Verify the readiness/liveness probe configuration.
kubectl exec -it <name> -- /bin/sh (if container starts briefly) to inspect the filesystem or environment.

CI/CD Pipeline Design

Q: Walk me through a production-grade CI/CD pipeline.

Code Push → PR Lint & Unit Tests → Build Docker Image →
Integration Tests → Security Scan (Snyk/Trivy) →
Push to Registry → Deploy to Staging (Argo CD) →
Smoke Tests → Manual Approval Gate (for production) →
Blue/Green Deploy to Production → Automated Rollback on Error Rate Spike

Q: What is the difference between blue/green and canary deployments?

Feature flags allow even finer-grained control: deploy code to 100% of servers but expose the feature to 1% of users, independent of deployment state.

Observability (Metrics, Logs, Traces)

Q: What are the three pillars of observability?

Metrics: Aggregated numerical data over time (Prometheus, Datadog, CloudWatch). Good for alerting and dashboards. Limited context on why something is wrong.
Logs: Structured event records (Loki, Splunk, Datadog Logs). Rich context but high volume; requires structured logging (JSON) and good indexing.
Traces: Distributed request traces (Jaeger, Zipkin, Datadog APM). Show the path of a request across services, revealing where latency is introduced.

Use metrics to detect problems, logs to investigate events, and traces to diagnose distributed latency.

Q: What is the RED method for monitoring microservices?

RED stands for:

Rate: Requests per second the service is handling.
Errors: Percentage of requests that are failing.
Duration: P50, P95, P99 latency of requests.

Every service should have dashboards and alerts covering these three signals. Compare to the USE method (for infrastructure resources): Utilization, Saturation, Errors.

Incident Management

Q: Walk me through how you would handle a production SEV1 incident.

Detect: Alert fires (or report comes in). Confirm the scope—how many users affected? Which services?
Declare: Page the on-call rotation. Open an incident channel (#incident-2024-0315). Assign Incident Commander (IC) role.
Mitigate first, investigate later: Can you roll back a recent deployment? Enable a feature flag to disable a broken feature? Add capacity? Take the fastest path to reducing user impact.
Communicate: Post regular updates to status page and internal stakeholders. Customers hate silence more than bad news.
Resolve: Confirm metrics return to baseline. Monitor for recurrence.
Postmortem: Blameless postmortem within 48 hours. 5 Whys analysis. Actionable follow-up items with owners and due dates.

Q: What makes a good postmortem?

Blameless: Systems fail, not people. Psychological safety is prerequisite for honest root cause analysis.
Timeline: Exact sequence of events with timestamps. "When did we first know something was wrong?"
Impact quantification: Users affected, revenue impact, SLO burn.
Root cause (not proximate cause): The deployment broke prod is a proximate cause. The root cause might be "no automated rollback on error rate threshold."
Action items: Specific, assigned, time-bounded. "Implement automated rollback by 2024-04-01, owned by [Name]."

Summary

Practice DevOps & SRE interview questions on Interview Masters →

Keep reading

More practical interview prep and product-thinking guides from the same editorial track.

Interview Q&A10 min read

QA Automation Engineer Interview Questions and Answers

Practice the most important QA automation engineer interview questions for 2026, including Selenium, API testing, CI/CD, frameworks, and flaky test answers.

Read article

Interview Q&A14 min read

Python Interview Questions and Answers for 2026

Study Python interview questions and answers for 2026, from core language behavior and data structures to practical debugging and performance tradeoffs.

Read article

Put this into practice

Turn reading into interview reps.

Build role-specific questions, practice with AI, and reinforce the exact concepts you just reviewed while they are still fresh.

Role-specific promptsAI feedback loopsFast repeat practice

Generate Questions with AI

DevOps & SRE Interview Questions: CI/CD, Kubernetes, Observability & On-Call

Introduction

Reliability Principles & SRE Concepts

Kubernetes

CI/CD Pipeline Design

Observability (Metrics, Logs, Traces)

Incident Management

Summary

Related articles

QA Automation Engineer Interview Questions and Answers

Top SQL Interview Questions and Answers for 2026 (Beginner to Advanced)

Python Interview Questions and Answers for 2026

Turn reading into interview reps.

DevOps & SRE Interview Questions: CI/CD, Kubernetes, Observability & On-Call

Introduction

Reliability Principles & SRE Concepts

Kubernetes

CI/CD Pipeline Design

Observability (Metrics, Logs, Traces)

Incident Management

Summary

Related articles

QA Automation Engineer Interview Questions and Answers

Top SQL Interview Questions and Answers for 2026 (Beginner to Advanced)

Python Interview Questions and Answers for 2026

Turn reading into interview reps.

If this was useful, pass it along.

Related articles

QA Automation Engineer Interview Questions and Answers

Top SQL Interview Questions and Answers for 2026 (Beginner to Advanced)

Python Interview Questions and Answers for 2026

Turn reading into interview reps.

If this was useful, pass it along.

Related articles

QA Automation Engineer Interview Questions and Answers

Top SQL Interview Questions and Answers for 2026 (Beginner to Advanced)

Python Interview Questions and Answers for 2026

Turn reading into interview reps.