Quick answer
Data engineering interviews are strongest when you can explain how data flows, where it breaks, and what tradeoffs keep it trustworthy at scale. SQL fluency matters, but system reasoning is what usually separates candidates.
Start here: Data Engineer Interview Prep
Read this next
Introduction
Data engineering is one of the fastest-growing disciplines in tech, yet interview prep resources remain shockingly thin compared to software engineering or data science. If you are interviewing for a data engineering role at a company like Snowflake, Databricks, Stripe, Airbnb, or any company with a serious data platform, this guide covers the exact questions and frameworks you need.
Category 1: SQL & Data Modeling
Even in an era of Spark and dbt, SQL fluency is table stakes. Data engineering SQL questions tend to be more complex than standard SWE SQL rounds.
Q: What is the difference between a star schema and a snowflake schema?
A star schema has a central fact table surrounded by denormalized dimension tables. Queries are fast because joins are minimal, but storage is redundant. Ideal for analytical workloads (OLAP).
A snowflake schema normalizes dimension tables into sub-dimensions, reducing redundancy at the cost of more complex joins. Better for storage efficiency; worse for query performance without aggressive indexing.
For most modern data warehouses (Snowflake, BigQuery, Redshift), star schemas with wide, denormalized dimension tables are preferred because storage is cheap and compute is the bottleneck.
Q: Write a SQL query to find the top 3 products by revenue in each region for the last 30 days.
WITH regional_revenue AS (
SELECT
region,
product_id,
SUM(revenue) AS total_revenue,
RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rnk
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY region, product_id
)
SELECT region, product_id, total_revenue
FROM regional_revenue
WHERE rnk <= 3
ORDER BY region, rnk;
Common follow-ups: "What if two products have the same revenue?" (use RANK vs DENSE_RANK vs ROW_NUMBER). "How would this query perform on 10TB of data?" (partitioning, clustering keys, materialized views).
Q: What are slowly changing dimensions (SCDs) and how do you implement SCD Type 2?
SCDs handle dimension attributes that change over time (e.g., a customer's address). SCD Type 2 maintains full history by adding a new row for each change:
| customer_id | address | effective_date | expiry_date | is_current |
|---|---|---|---|---|
| 42 | 123 Main St | 2023-01-01 | 2024-06-14 | FALSE |
| 42 | 456 Oak Ave | 2024-06-15 | 9999-12-31 | TRUE |
Implementation in dbt uses the snapshot feature. In raw SQL, use a MERGE statement with surrogate keys.
Practice this with Interview Masters
Generate role-specific pipeline, SQL, and warehouse drills with Interview Masters so you can practice the exact follow-ups interviewers use.
Category 2: Apache Spark
Q: Explain the difference between transformations and actions in Spark.
Transformations are lazy—they define a new RDD/DataFrame but do not execute immediately: map, filter, groupBy, join, select.
Actions trigger execution of the entire DAG: count, collect, show, write, save.
This laziness allows Spark to optimize the execution plan (via the Catalyst optimizer) before running anything.
Q: What causes a Spark job to shuffle? Why is shuffling expensive?
A shuffle occurs when data must be redistributed across partitions—typically triggered by groupBy, join, distinct, repartition. It is expensive because:
- Data is serialized, written to disk, sent over the network, and deserialized.
- It is the most common cause of OOM errors and job stalls.
- It breaks pipeline pipelining—Spark must complete the preceding stage before starting the next.
Mitigation strategies: Use broadcast joins for small tables (broadcast(smallDf)), pre-partition data on join keys, use reduceByKey over groupByKey (former combines locally first), increase spark.sql.shuffle.partitions for large datasets.
Q: What is the difference between cache() and persist() in Spark?
cache() stores the DataFrame in memory using the default MEMORY_AND_DISK storage level.
persist(StorageLevel) allows you to specify the storage level: MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK, OFF_HEAP, etc.
Use caching when a DataFrame is used multiple times in a pipeline. Always call unpersist() when done to free memory.
Q: How do you handle skewed data in a Spark join?
Data skew occurs when one or more join keys have disproportionately large data (e.g., NULL keys, a top customer with millions of orders). Strategies:
- Salting: Add a random prefix to the skewed key, explode the other side to match.
- Broadcast join: If one table is small enough (<= broadcast threshold, default 10MB).
- AQE (Adaptive Query Execution): Enabled by default in Spark 3.x. Automatically coalesces shuffle partitions and converts sort-merge joins to broadcast joins when one side is small at runtime.
Category 3: Apache Airflow & Pipeline Orchestration
Q: What is a DAG in Airflow and how does it work?
A DAG (Directed Acyclic Graph) defines the workflow—each node is a task, and edges define dependencies. Airflow's scheduler parses DAG files, determines which tasks are due to run based on their schedule and dependencies, and submits them to the executor (LocalExecutor, CeleryExecutor, KubernetesExecutor).
Q: What is the difference between task-level and DAG-level retries?
DAG-level retries restart the entire DAG. Task-level retries (retries param on an operator) retry only the failed task, preserving upstream successful state. Almost always prefer task-level retries.
Q: How do you handle a late-arriving data dependency in Airflow?
Options:
- ExternalTaskSensor: Waits for a task in another DAG to complete before proceeding.
- FileSensor / S3KeySensor: Waits for a file to land before proceeding.
- Deferred operators (Airflow 2.2+): Smart polling that releases the worker slot while waiting.
- Data-aware scheduling (Airflow 2.4+): DAGs can be triggered when a dataset is updated, rather than on a fixed schedule.
Q: What is an idempotent pipeline and why does it matter?
An idempotent pipeline produces the same result whether it runs once or ten times for the same input. This is critical for safe retries and backfills. Design for idempotency by using INSERT OVERWRITE / MERGE instead of INSERT, partitioning output by date, and using deterministic transformations.
Category 4: Apache Kafka & Streaming
Q: What is a Kafka topic partition and why does it matter?
A partition is the unit of parallelism in Kafka. Each partition is an ordered, immutable log. Within a partition, order is guaranteed; across partitions, it is not. More partitions = more throughput but more overhead. Consumer group members read from distinct partitions, so partition count caps consumer parallelism.
Q: What does "exactly-once semantics" mean in Kafka? How do you achieve it?
- At-most-once: Messages may be lost but never duplicated.
- At-least-once: Messages are never lost but may be duplicated.
- Exactly-once: Messages are delivered and processed precisely once.
Kafka achieves exactly-once semantics (EOS) via idempotent producers (enabled with enable.idempotence=true) and transactional APIs. Kafka Streams and Flink have native EOS support. For Spark Structured Streaming, use checkpoint-based offsets with idempotent sinks.
Q: How would you design a pipeline to detect fraudulent transactions in under 500ms?
Architecture: Transactions → Kafka → Flink/Spark Structured Streaming → Feature computation (windowed aggregations on card history) → Fraud model inference → Kafka output topic → Alert service. Key considerations: consumer lag monitoring, exactly-once processing guarantees, feature store for low-latency feature retrieval, and circuit breakers on the ML model call.
Category 5: Data Pipeline Design (System Design for Data Engineers)
Q: Design a pipeline to ingest and serve 1TB of clickstream data per day with P95 query latency under 2 seconds.
Expected discussion points:
- Ingestion: Kafka or Kinesis for real-time, S3 landing zone for batch.
- Processing: Spark on EMR or Databricks for transformation.
- Storage: Delta Lake or Iceberg for ACID transactions + time travel. Parquet format partitioned by date.
- Serving layer: Athena for ad-hoc, Redshift or Snowflake for BI tools, pre-aggregated tables for dashboards.
- Orchestration: Airflow DAGs with SLA monitoring.
- Observability: Great Expectations for data quality assertions, PagerDuty alerts on pipeline failures.
Summary
Data engineering interviews reward deep knowledge of distributed systems fundamentals combined with practical pipeline design experience. Study Spark internals, design idempotent pipelines, and be ready to whiteboard an end-to-end data architecture.
Practice data engineering system design with Interview Masters AI →
