Introduction
The ML system design interview is the most underrated and underestimated round in the machine learning interview loop. Many strong ML researchers—who can derive backpropagation from scratch—fail this round because they approach it like a research problem rather than an engineering problem.
This guide covers the framework, common problem types, and the signal interviewers are looking for at companies like Google, Meta, Airbnb, Netflix, and Spotify.
The ML System Design Framework
Unlike pure system design, ML system design has a specific lifecycle. Use this structure for every problem:
- Problem Formulation: Define the ML problem type (classification? ranking? regression?), the objective, and the proxy metric.
- Data: Training data sources, labeling strategy, feature engineering, data pipeline.
- Modeling: Model selection and architecture, training infrastructure, loss function.
- Evaluation: Offline metrics, online metrics (A/B test), business metrics.
- Serving & Inference: Latency requirements, serving infrastructure, batch vs. real-time.
- Monitoring & Retraining: Model drift detection, retraining triggers, MLOps.
Problem 1: Design a News Feed Ranking System
Problem formulation: The goal is to maximize long-term user engagement. Proxy metric: P(user engages with post) where engagement = like, comment, share, or dwell time > 5 seconds.
Frame as a ranking problem: given a candidate set of N posts, rank them by predicted engagement probability.
Training data:
- Implicit feedback: clicks, dwell time, shares (positive signals).
- Skip events: scrolled past without engaging (negative signals).
- Need to handle exposure bias: posts that were never shown have no feedback, creating selection bias in training data.
Feature engineering:
User features: Age of account, historical engagement rate, content preferences (embeddings from past interactions), time of day, device type.
Post features: Author affinity (how often does this user engage with this author?), content recency, media type (video > image > text for engagement), post embeddings (from a pre-trained transformer).
Cross features: User-post interaction features, predicted topic affinity.
Model architecture:
- Candidate generation: Two-tower model (user embedding × item embedding → cosine similarity) for fast retrieval from millions of posts.
- Ranking: Deep neural network with wide features (memorization) + deep features (generalization), similar to Google's Wide & Deep architecture.
Serving:
- Candidate generation: Approximate Nearest Neighbor (FAISS, ScaNN) against pre-computed post embeddings. Run every few minutes.
- Ranking model: Inference on top-K candidates (e.g., K=500). Real-time inference, P99 < 100ms.
- Feature store: Pre-computed user features cached in Redis for low-latency retrieval.
Problem 2: Design a Real-Time Fraud Detection System
Problem formulation: Binary classification: P(transaction is fraudulent). Objective: maximize fraud recall while keeping false positive rate below a threshold (false positives mean declined legitimate transactions → churn).
Key constraints:
- Latency: < 50ms (must respond before the transaction completes).
- Class imbalance: Fraud rate is typically < 0.1%. Naive model predicts "not fraud" and achieves 99.9% accuracy but zero utility.
Handling class imbalance:
- Oversample the minority class (SMOTE) or undersample majority.
- Use focal loss to down-weight easy negatives.
- Calibrate prediction probabilities—raw model outputs are not reliable probabilities with imbalanced data.
Features:
- Transaction features: amount, merchant category, location, device fingerprint.
- Velocity features: number of transactions in last 1 min / 5 min / 1 hour per card (requires real-time stream processing via Flink or Spark Structured Streaming).
- Graph features: Has this merchant been associated with fraud before? Social graph features from payment network.
- Behavioral features: Is this transaction consistent with the user's historical patterns?
Architecture:
Transaction event → Kafka →
Real-time feature computation (Flink) →
Feature store (Redis) →
ML inference service (ONNX model, <10ms inference) →
Risk score → Decision engine (rules + ML) →
Allow/Deny response
Monitoring: Track false positive rate, false negative rate, and model score distribution daily. Flag for retraining when score distribution drifts (using KL divergence or PSI metric).
Problem 3: Design a Search Ranking System
Problem formulation: Given a query, rank a candidate set of documents by relevance. Objective: NDCG (Normalized Discounted Cumulative Gain) or MRR (Mean Reciprocal Rank). Business metric: CTR, search session success rate.
Learning to Rank (LTR) approaches:
- Pointwise: Predict a relevance score for each query-document pair independently. Simple but ignores relative ordering.
- Pairwise: Predict which of two documents is more relevant. RankNet, LambdaRank.
- Listwise: Optimize directly on ranking metrics. LambdaMART (the industry standard). Used by virtually every major search engine.
Query understanding:
- Spell correction (noisy channel model).
- Query expansion: synonyms, related terms from embedding space.
- Query classification: navigational vs. informational vs. transactional intent.
Document features:
- BM25 score (traditional term-frequency relevance).
- Semantic similarity: Dense embeddings from bi-encoder (BERT-based) for query-document similarity.
- Document quality signals: Page rank, freshness, click-through rate, bounce rate.
Serving pipeline:
- Retrieval: BM25 (Elasticsearch) + ANN search on dense embeddings (FAISS). Union of both candidate sets.
- Lightweight ranking (L1): Fast XGBoost model on sparse features. Reduces 10,000 candidates to 100.
- Deep ranking (L2): Cross-encoder (query + document concatenated) — most accurate but slow. Applied to top 100 candidates only.
- Business logic layer: Apply diversity rules, sponsored content insertion, safety filters.
Evaluation & Metrics
Q: What is the difference between offline and online evaluation for ML systems?
Offline evaluation: Evaluate model on a held-out test set using metrics like AUC-ROC, NDCG, F1. Fast and cheap but often disconnects from real-world business impact.
Online evaluation (A/B testing): Deploy the new model to a random subset of users and measure business metrics (engagement rate, revenue, churn). Requires statistical rigor: sufficient sample size, proper randomization, correction for multiple comparisons.
Never trust offline metrics alone. A model that improves AUC by 0.01 may hurt revenue if it promotes content that gets short clicks but high bounce rates.
MLOps & Monitoring
Q: What is concept drift and how do you detect it?
Concept drift occurs when the statistical relationship between features and the target changes over time—common in fraud detection (fraud patterns evolve), recommendation (user tastes change), and any domain with non-stationary data.
Detection strategies:
- Monitor prediction score distribution over time. Use PSI (Population Stability Index) or KL divergence vs. training distribution.
- Monitor business metrics (CTR, fraud rate) for unexplained shifts.
- Use challenger models: continuously train a fresh model on recent data and compare to the production model.
Retraining triggers: Time-based (daily/weekly) for slowly changing domains; performance-based (when PSI exceeds threshold) for fast-changing domains.
Summary
ML system design interviews reward engineers who can bridge research and production. Demonstrate that you understand the full lifecycle: from problem formulation and data quality to low-latency serving and drift monitoring. Never focus exclusively on model architecture—the serving infrastructure and monitoring strategy are equally important.
