Introduction
SRE and DevOps interviews are unlike standard software engineering interviews. Yes, you will write code—but the questions probe your understanding of distributed systems failure modes, your ability to design for reliability, and your judgment during high-stakes incidents. Companies like Google, PagerDuty, Cloudflare, HashiCorp, and most cloud-native startups are aggressively hiring in this space.
Reliability Principles & SRE Concepts
Q: What is an SLO, SLA, and SLI? How do they relate?
- SLI (Service Level Indicator): A specific metric that measures service behavior. Example: request success rate, P99 latency, error rate.
- SLO (Service Level Objective): An internal reliability target. Example: "99.9% of requests succeed within 200ms over a 30-day window."
- SLA (Service Level Agreement): An external contractual commitment with customers, usually less aggressive than the internal SLO. Breach of SLA triggers financial penalties.
The SLO should be set below what the system actually achieves—providing an error budget. If the SLO is 99.9% and you're achieving 99.95%, you have 0.05% error budget to spend on risky changes, experiments, and planned downtime.
Q: What is an error budget and how does it influence release velocity?
An error budget is the allowed unreliability within a given window. If your SLO is 99.9% over 30 days, your error budget is 0.1%—about 43 minutes of downtime.
When error budget is plentiful: move fast, deploy frequently, experiment. When error budget is depleted: freeze feature deployments, focus on reliability improvements until the window resets.
This creates a principled, data-driven mechanism for balancing reliability vs. velocity without perpetual political debates.
Kubernetes
Q: Explain the Kubernetes control plane components.
- API Server: The central REST interface for all cluster operations. kubectl, controllers, and kubelets all communicate through it.
- etcd: Distributed key-value store that holds all cluster state. If etcd is unhealthy, the cluster cannot make scheduling decisions.
- Scheduler: Watches for unscheduled Pods and assigns them to Nodes based on resource requirements, affinity rules, and taints/tolerations.
- Controller Manager: Runs controllers (Deployment, ReplicaSet, Node, etc.) that reconcile desired state with actual state.
Q: What is a Pod Disruption Budget (PDB) and when do you use it?
A PDB defines the minimum number of replicas that must remain available during voluntary disruptions (node drains, cluster upgrades). Without a PDB, a drain could remove all replicas of a service simultaneously.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # at least 2 replicas must be running during disruption
selector:
matchLabels:
app: api-server
Q: How do you handle a pod that is stuck in CrashLoopBackOff?
Diagnostic steps:
kubectl describe pod <name>— Check events section for resource limits, liveness probe failures, or image pull errors.kubectl logs <name> --previous— View logs from the previous (crashed) container.- Check resource requests/limits — OOMKilled events appear in
describe. - Verify the readiness/liveness probe configuration.
kubectl exec -it <name> -- /bin/sh(if container starts briefly) to inspect the filesystem or environment.
CI/CD Pipeline Design
Q: Walk me through a production-grade CI/CD pipeline.
Code Push → PR Lint & Unit Tests → Build Docker Image →
Integration Tests → Security Scan (Snyk/Trivy) →
Push to Registry → Deploy to Staging (Argo CD) →
Smoke Tests → Manual Approval Gate (for production) →
Blue/Green Deploy to Production → Automated Rollback on Error Rate Spike
Key principles: fail fast (linting before expensive tests), immutable artifacts (same image promoted through environments—never rebuild), automated rollback triggers (deployment controller watches error rate SLO and rolls back automatically).
Q: What is the difference between blue/green and canary deployments?
Blue/Green: Two identical environments. Traffic switches 0% → 100% from old to new instantly. Fast rollback (switch traffic back). High infrastructure cost (double the resources during deployment).
Canary: Incremental traffic shift—5% → 25% → 50% → 100%. Slower, but allows real-world validation at small blast radius. Ideal for high-traffic services where you want to soak-test with a subset of users.
Feature flags allow even finer-grained control: deploy code to 100% of servers but expose the feature to 1% of users, independent of deployment state.
Observability (Metrics, Logs, Traces)
Q: What are the three pillars of observability?
- Metrics: Aggregated numerical data over time (Prometheus, Datadog, CloudWatch). Good for alerting and dashboards. Limited context on why something is wrong.
- Logs: Structured event records (Loki, Splunk, Datadog Logs). Rich context but high volume; requires structured logging (JSON) and good indexing.
- Traces: Distributed request traces (Jaeger, Zipkin, Datadog APM). Show the path of a request across services, revealing where latency is introduced.
Use metrics to detect problems, logs to investigate events, and traces to diagnose distributed latency.
Q: What is the RED method for monitoring microservices?
RED stands for:
- Rate: Requests per second the service is handling.
- Errors: Percentage of requests that are failing.
- Duration: P50, P95, P99 latency of requests.
Every service should have dashboards and alerts covering these three signals. Compare to the USE method (for infrastructure resources): Utilization, Saturation, Errors.
Incident Management
Q: Walk me through how you would handle a production SEV1 incident.
- Detect: Alert fires (or report comes in). Confirm the scope—how many users affected? Which services?
- Declare: Page the on-call rotation. Open an incident channel (#incident-2024-0315). Assign Incident Commander (IC) role.
- Mitigate first, investigate later: Can you roll back a recent deployment? Enable a feature flag to disable a broken feature? Add capacity? Take the fastest path to reducing user impact.
- Communicate: Post regular updates to status page and internal stakeholders. Customers hate silence more than bad news.
- Resolve: Confirm metrics return to baseline. Monitor for recurrence.
- Postmortem: Blameless postmortem within 48 hours. 5 Whys analysis. Actionable follow-up items with owners and due dates.
Q: What makes a good postmortem?
- Blameless: Systems fail, not people. Psychological safety is prerequisite for honest root cause analysis.
- Timeline: Exact sequence of events with timestamps. "When did we first know something was wrong?"
- Impact quantification: Users affected, revenue impact, SLO burn.
- Root cause (not proximate cause): The deployment broke prod is a proximate cause. The root cause might be "no automated rollback on error rate threshold."
- Action items: Specific, assigned, time-bounded. "Implement automated rollback by 2024-04-01, owned by [Name]."
Summary
SRE interviews reward engineers who think in systems, design for failure, and remain calm under pressure. Practice articulating reliability principles (SLOs, error budgets), demonstrate Kubernetes and observability depth, and have an incident you can walk through in vivid detail.
Practice DevOps & SRE interview questions on Interview Masters →
