DevOps Engineer Interview Questions & Answers
DevOps interviews assess your knowledge of infrastructure, automation, CI/CD, monitoring, and your ability to bridge development and operations. This guide covers the most frequently asked questions.
CI/CD and Automation
Q1: Explain the difference between continuous integration, continuous delivery, and continuous deployment.
Continuous Integration (CI): Developers frequently merge code changes to a shared repository, triggering automated builds and tests. Goal: catch integration issues early.
Continuous Delivery (CD): Extends CI by automatically preparing code for release to production. Every change that passes tests is deployable, but deployment requires manual approval.
Continuous Deployment: Every change that passes automated tests is automatically deployed to production. Requires robust testing and monitoring.
Q2: How would you design a CI/CD pipeline for a microservices application?
Key components:
Source Stage: Code commit triggers pipeline, version control integration (Git webhooks)
Build Stage: Build containers for each service, run unit tests, generate artifacts
Test Stage: Integration tests, contract tests between services, performance tests
Security Stage: SAST, DAST, dependency scanning, container image scanning
Deploy Stage: Progressive rollout (canary, blue-green), environment-specific configs, automated rollback capability
Considerations: Parallel execution for independent services, service dependencies, database migrations, feature flags for gradual rollout.
Q3: What is Infrastructure as Code (IaC) and why is it important?
IaC manages infrastructure through code rather than manual processes. Tools: Terraform, CloudFormation, Pulumi, Ansible.
Benefits:
- Version control: Track changes, review, and roll back
- Reproducibility: Consistent environments from dev to prod
- Automation: Reduce human error, speed up provisioning
- Documentation: Code serves as living documentation
- Testing: Validate infrastructure changes before applying
Best practices: Modular design, separate environments, state management, use of variables and outputs.
Q4: Explain the differences between Ansible, Terraform, and Chef/Puppet.
Terraform: Declarative, infrastructure provisioning, cloud-agnostic, state management, idempotent
Ansible: Procedural and declarative, agentless (SSH), configuration management and orchestration, YAML playbooks
Chef/Puppet: Configuration management, agent-based, define desired state, better for ongoing configuration enforcement
When to use: Terraform for infrastructure provisioning, Ansible for configuration and application deployment, Chef/Puppet for complex ongoing configuration management.
Containers and Orchestration
Q5: Explain the difference between Docker and Kubernetes.
Docker: Container runtime that packages applications with dependencies. Creates, runs, and manages individual containers.
Kubernetes: Container orchestration platform. Manages deployment, scaling, networking, and operations of containers across clusters.
Relationship: Kubernetes uses container runtimes (like containerd, which Docker uses) to run containers. Kubernetes handles the orchestration layer.
Q6: How does Kubernetes networking work?
Pod-to-Pod networking: Every pod gets a unique IP. Pods can communicate directly without NAT.
Service abstraction: Services provide stable endpoints for pods. Types: ClusterIP (internal), NodePort (external via node ports), LoadBalancer (cloud provider LB).
Ingress: HTTP/HTTPS routing from external traffic to services. Handles SSL termination, path-based routing.
Network Policies: Firewall rules for pod-to-pod communication.
CNI plugins: Handle the actual network implementation (Calico, Flannel, Weave).
Q7: Explain Kubernetes deployment strategies.
Rolling Update (default): Gradually replace old pods with new ones. Zero-downtime but both versions run briefly.
Recreate: Terminate all old pods, then create new ones. Has downtime but ensures single version running.
Blue-Green: Run two identical environments, switch traffic. Fast rollback but resource-intensive.
Canary: Route small percentage of traffic to new version. Gradual rollout, quick detection of issues.
Implementation: Use Deployments for rolling updates, Argo Rollouts or Flagger for advanced strategies.
Q8: How do you handle secrets in Kubernetes?
Kubernetes Secrets: Base64 encoded (not encrypted at rest by default), mounted as volumes or environment variables.
Better alternatives:
- External secrets managers: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- Sealed Secrets: Encrypt secrets that can only be decrypted by cluster
- External Secrets Operator: Sync secrets from external managers to Kubernetes
Best practices: Enable encryption at rest for etcd, RBAC for secret access, rotate secrets regularly, audit access.
Monitoring and Observability
Q9: What are the three pillars of observability?
Logs: Record of discrete events. What happened and when. Tools: ELK Stack, Loki, Splunk.
Metrics: Numerical measurements over time. System health indicators. Tools: Prometheus, Datadog, CloudWatch.
Traces: Follow request path through distributed system. Identify bottlenecks. Tools: Jaeger, Zipkin, OpenTelemetry.
Together: Logs tell you what, metrics tell you when (and alert), traces tell you where in the system.
Q10: How would you set up alerting for a production system?
Principles:
- Alert on symptoms, not causes (e.g., error rate, not CPU usage)
- Every alert should be actionable
- Avoid alert fatigue with proper thresholds
- Include runbook links in alerts
Implementation:
- Define SLOs (Service Level Objectives)
- Create SLIs (Service Level Indicators) as metrics
- Alert when SLIs threaten SLOs
- Escalation policies (who gets paged when)
- Post-incident review to improve alerting
Q11: Explain the difference between monitoring and observability.
Monitoring: Collecting and tracking predefined metrics and logs. Answers "Is the system working?" Good for known failure modes.
Observability: Ability to understand system state from external outputs. Answers "Why isn't it working?" and "What's happening that I didn't expect?" Essential for complex distributed systems.
Observability enables: Debugging unknown issues, understanding system behavior, discovering unknown unknowns.
Cloud and Infrastructure
Q12: Explain the differences between IaaS, PaaS, and SaaS.
IaaS (Infrastructure as a Service): Virtualized computing resources (VMs, storage, networking). You manage: OS, runtime, applications. Examples: AWS EC2, Azure VMs.
PaaS (Platform as a Service): Platform for developing and running applications. You manage: applications and data. Examples: Heroku, AWS Elastic Beanstalk, Azure App Service.
SaaS (Software as a Service): Ready-to-use software. You manage: your data within the application. Examples: Salesforce, Google Workspace.
Q13: How do you approach cloud cost optimization?
Visibility: Tagging resources, cost allocation by team/project, budgets and alerts
Right-sizing: Match instance sizes to actual usage, use monitoring data
Pricing models: Reserved instances for steady workloads, spot/preemptible for fault-tolerant workloads, savings plans
Architecture: Auto-scaling, serverless where appropriate, data lifecycle policies, cleanup unused resources
Governance: Cost reviews, policies preventing over-provisioning, FinOps practices
Q14: Design a highly available architecture on AWS.
Components:
- Multi-AZ deployment: Resources in multiple Availability Zones
- Load balancing: ALB/NLB distributing traffic across AZs
- Auto Scaling: Automatically adjust capacity based on demand
- Database: RDS Multi-AZ or Aurora for automatic failover
- Caching: ElastiCache cluster mode for distributed caching
- DNS: Route 53 with health checks and failover routing
Considerations: Data consistency across zones, proper health checks, graceful degradation, disaster recovery plan.
Security
Q15: What is the principle of least privilege and how do you implement it?
Principle: Users/systems should have only the minimum access needed to perform their function.
Implementation:
- Role-based access control (RBAC)
- Service accounts with limited permissions
- Regular access reviews and audits
- Just-in-time access for elevated permissions
- Separate credentials for different environments
AWS example: IAM policies scoped to specific resources and actions, not using root account, temporary credentials via STS.
Q16: How do you secure a container deployment?
Image security: Use trusted base images, scan for vulnerabilities, keep images updated, minimize image size
Runtime security: Run as non-root, read-only file system where possible, drop unnecessary capabilities, seccomp/AppArmor profiles
Network security: Network policies, encrypt traffic (mTLS), separate namespaces for isolation
Access control: RBAC for cluster access, Pod Security Standards, audit logging
Secrets management: External secrets managers, encrypted at rest, minimal secret scope
Scenario-Based Questions
Q17: Your application is experiencing slow response times. How do you debug?
Systematic approach:
Define the problem: Which endpoints? All users or some? When did it start?
Check metrics: CPU, memory, network, disk I/O of application and infrastructure
Check traces: Where in the request path is time spent?
Check logs: Error messages, slow query warnings
Check dependencies: Database performance, external API latency, cache hit rates
Check recent changes: Deployments, config changes, traffic patterns
Isolate and test: Reproduce in staging, test hypothesis
Q18: Your team is deploying multiple times a week, but it's causing production issues. How do you improve?
Analysis: What types of issues occur? At what stage do they appear?
Improvements:
- Better testing: More comprehensive tests, staging environment that mirrors prod
- Gradual rollout: Canary deployments, feature flags
- Monitoring: Better observability to catch issues quickly
- Automation: Remove manual steps that introduce errors
- Change management: Review process, smaller changes, easy rollback
- Post-mortems: Learn from each incident, implement preventive measures
This guide covers essential DevOps interview topics. Focus on demonstrating both technical depth and practical experience. Be prepared to discuss specific tools you've used and challenges you've solved in production environments.
