DevOps Interview Diagrams

3-Tier AWS Architecture

Production-ready VPC layout across 3 AZs · public + private subnets · multi-AZ database · ALB at each tier

🌐 Internet → Route 53 (DNS) → CloudFront (CDN) → WAF + Shield

↓

VPC · 10.0.0.0/16

⚖️ frontend-alb · Internet-Facing · Port 443 HTTPS · spans all 3 AZs

↓

☁️ Availability Zone A

public-web-subnet-a10.0.0.0/20

🌐NAT Gateway

private-web-subnet-a10.0.48.0/20

🖥️frontend-server

private-app-subnet-a10.0.96.0/20

⚙️backend-server

private-db-subnet-a10.0.144.0/20

🗄️RDS Primary ★

☁️ Availability Zone B

public-web-subnet-b10.0.16.0/20

🌐NAT Gateway

private-web-subnet-b10.0.64.0/20

🖥️frontend-server

private-app-subnet-b10.0.112.0/20

⚙️backend-server

private-db-subnet-b10.0.160.0/20

🗄️RDS Standby ↔

☁️ Availability Zone C

public-web-subnet-c10.0.32.0/20

private-web-subnet-c10.0.80.0/20

🖥️frontend-server

private-app-subnet-c10.0.128.0/20

⚙️backend-server

private-db-subnet-c10.0.176.0/20

📖RDS Read Replica

⚖️ backend-alb · Internal · Port 8080 · spans all 3 AZs · NOT internet-facing

Interview Tip: Key points to mention — (1) NAT Gateway in public subnet for private server outbound traffic, (2) ALBs in public subnets route to private servers, (3) backend-alb is internal-only, (4) RDS Multi-AZ = synchronous replication = automatic failover in 1-2 mins, (5) Read Replica in AZ-C for read scaling, (6) Security Groups at instance level + NACLs at subnet level.

Kubernetes Architecture

Control plane components · worker node components · how kubectl commands flow through the cluster

🧠 Control Plane (Master Node)

🌐 API Server

Front door of the cluster. ALL communication goes through here — kubectl, pipelines, internal components. Validates and processes REST requests.

🗄️ etcd

Distributed key-value store. Brain of the cluster. Stores ALL cluster state — configs, secrets, resource definitions, current status.

📅 Scheduler

Watches for unscheduled pods. Selects best node based on: resources, taints/tolerations, affinity rules, node conditions.

🔄 Controller Manager

Runs control loops. ReplicaSet controller maintains desired pod count. Node controller monitors node health. Job controller manages batch jobs.

☁️ Cloud Controller

Integrates with cloud provider APIs. Manages Load Balancers, Node lifecycle and Routes in AWS/GCP/Azure.

⚙️ Worker Nodes

Node 1 · m5.xlarge

🔗 kubelet

Agent on every node. Receives pod specs from API server. Ensures containers are running and healthy.

🔀 kube-proxy

Manages iptables/IPVS rules. Enables pod-to-pod and service-to-pod communication.

📦 Container Runtime

containerd / CRI-O. Actually runs the containers.

pod: api-1

pod: api-2

pod: worker-1

Node 2 · m5.xlarge

🔗 kubelet

Same as Node 1. Every node has a kubelet reporting to control plane.

🔀 kube-proxy

Each node has its own kube-proxy maintaining local network rules.

📦 Container Runtime

containerd / CRI-O. Pulls images, creates containers.

pod: frontend-1

pod: cache-1

📋 Scheduler Decision Flow

1. Filtering → remove nodes with insufficient CPU/Memory, taints without toleration, wrong affinity
2. Scoring → rank remaining nodes by available resources, spread policies, affinity score
3. Binding → pod assigned to highest scoring node

Interview Tip: etcd = database of cluster, API Server = single entry point, Scheduler = decides WHERE pods run, Controller Manager = ensures desired state matches actual state, kubelet = ensures pods RUN on each node, kube-proxy = handles NETWORKING between pods.

Complete CI/CD Pipeline

Code commit → build → test → scan → deploy · with tools at each stage · from developer to production

1

💾 Code Commit & PR

GitHub GitLab

Developer pushes feature branch → opens Pull Request → pipeline triggers automatically on PR creation. Branch protection rules prevent direct commits to main.

2

🔍 Static Code Analysis

SonarQube ESLint

Quality Gates: code coverage ≥ 80%, zero critical issues, duplication < 3%. If gate fails → pipeline stops, developer notified. No broken code proceeds.

3

🧪 Unit & Integration Tests

JUnit pytest

Run all unit tests and integration tests. Parallel test execution to reduce pipeline time. Test results published as artifacts for review.

4

🔒 Security Scanning (SAST)

OWASP ZAP Terrascan

SAST scan for SQL injection, XSS, broken auth. For Terraform code: Checkov/Terrascan checks for open security groups, unencrypted storage, IAM misconfigs.

5

🐳 Build & Container Scan

Docker Trivy Aqua

Multi-stage Docker build (slim final image). Trivy/Aqua scans image for CVEs → --exit-code 1 fails pipeline on CRITICAL findings. Sign image for Binary Authorization.

6

📤 Push to Registry

ECR GCR

Tag: app:1.3.0-build-42-abc123f · Push verified + signed image to ECR. Update Helm chart values.yaml with new image tag. Commit back to GitOps repo.

7

🚀 Deploy via ArgoCD

ArgoCD Helm

ArgoCD detects new image tag in Git → triggers Helm upgrade on EKS → Rolling update / Canary strategy → readiness/liveness probes validate. Auto-rollback on failure.

8

✅ Manual Gate → Production

Approval

Pipeline pauses for senior engineer review → approves Terraform plan / deployment plan → ArgoCD syncs to production cluster. Full audit trail in Git.

Interview Tip: Mention BOTH CI (build/test/scan = GitHub Actions / GitLab CI / Jenkins) and CD (deploy = ArgoCD) as separate concerns. Key terms: Quality Gate, SAST, image signing, binary authorization, GitOps, rolling update, canary, manual approval gate.

Terraform Directory Structure

Multi-environment modular structure · best practices · remote state · workspace strategy

terraform/

modules/ # reusable components

vpc/

main.tf variables.tf outputs.tf

eks/

main.tf variables.tf outputs.tf

rds/

main.tf variables.tf outputs.tf

security/

main.tf variables.tf outputs.tf

environments/ # env-specific configs

dev/

main.tf backend.tf

terraform.tfvars # t3.medium

stage/

main.tf backend.tf

terraform.tfvars # t3.large

prod/

main.tf backend.tf

terraform.tfvars # m5.xlarge

landing-zone/ # multi-account

logging-account/

security-account/

network-account/

workload-accounts/

🔒 Remote Backend Config

            terraform {

              backend "s3" {

                bucket        = "tf-state-prod"

                key          = "prod/eks/terraform.tfstate"

                region       = "us-east-1"

                encrypt      = true

                dynamodb_table = "tf-lock"

              }

            }

🔄 Module Call Pattern

            module "eks" {

              source       = "../../modules/eks"

              cluster_name = "prod-cluster"

              node_type    = var.node_instance_type

              min_nodes    = 3

              max_nodes    = 10

            }

🛡️ Prevent Manual Deletion

1. Tag all TF resources: ManagedBy = "terraform"
2. IAM Deny policy with condition:
  → Effect: Deny
  → Action: ec2:TerminateInstances
  → Condition: ResourceTag/ManagedBy = terraform
3. Attach Deny policy to all IAM users
4. Explicit Deny always overrides Allow

State Locking

S3 stores state file. DynamoDB creates lock entry when apply runs. Second engineer gets "Error: state locked" until first finishes.

Drift Detection

terraform plan -detailed-exitcode detects drift. terraform import pulls manual changes into state. terraform refresh syncs state without applying.

Safe Rename

terraform state mv old_name new_name renames resource in state only. No infrastructure change. Prevents accidental delete+recreate.

Interview Tip: Always mention 5 Terraform best practices: (1) modular structure, (2) remote backend with versioning, (3) state locking with DynamoDB, (4) terraform plan before apply, (5) secrets in Vault/Secrets Manager never in .tf files.

DevOps Security Layers

Defense in depth · security at every layer from code to runtime · DevSecOps approach

💻

Layer 1 — Code Security

Security starts at the developer's machine. Secrets should never enter the codebase. SAST catches vulnerabilities before code is committed.

SonarQube git-secrets pre-commit hooks OWASP ZAP (DAST)

🐳

Layer 2 — Container Security

Scan images for CVEs before they reach any cluster. Only signed images allowed to run. Minimal base images reduce attack surface.

Trivy (--exit-code 1) Aqua Security Binary Authorization Cosign (image signing)

☸️

Layer 3 — Kubernetes Security

Control what runs in the cluster, who can access what, and how pods communicate with each other.

RBAC NetworkPolicy OPA Gatekeeper Pod Security Admission IRSA (no access keys) Istio mTLS

☁️

Layer 4 — Cloud / Network Security

AWS-level protection. Control traffic at subnet level, protect against DDoS, manage secrets centrally.

Security Groups NACLs WAF AWS Shield Secrets Manager IAM least privilege VPC Flow Logs

🔍

Layer 5 — Audit & Compliance

Continuous monitoring for threats, configuration drift and compliance violations. Everything is logged and audited.

CloudTrail GuardDuty AWS Config Falco (runtime) Prisma Cloud Security Hub

Interview Tip: Always answer security questions in layers — code → container → Kubernetes → cloud → audit. Key phrase: "Defense in Depth". Never hardcode secrets. IRSA for EKS pod AWS access (no access keys). Explicit Deny in IAM always overrides Allow.

GitOps Flow with ArgoCD

Git as single source of truth · continuous reconciliation · drift detection · declarative deployments

👨‍💻

Developer

Pushes code to feature branch, opens PR

→

🔄

CI Pipeline

Build, test, scan, push image to ECR

→

📦

Git Repo

Update Helm values.yaml with new image tag

→

🔁

ArgoCD

Detects Git change, syncs to cluster

→

☸️

EKS Cluster

Rolling update deployed, health checks pass

🔄 GitOps vs Traditional CI/CD

❌ Traditional Push

Pipeline runs kubectl apply
Manual drift undetected
No single source of truth
Rollback = re-run pipeline
No auto-reconciliation

✅ GitOps Pull

ArgoCD pulls from Git
Drift detected + corrected
Git = single source of truth
Rollback = git revert
Continuous reconciliation

🎯 ArgoCD Key Concepts

A

Application

ArgoCD resource that links a Git repo path to a cluster/namespace target

S

Sync

Process of making cluster match the desired state in Git. Auto or manual.

D

Drift

When actual cluster state differs from Git. ArgoCD shows OutOfSync status.

H

Health

ArgoCD checks if deployed resources are healthy — pods running, ingress responding.

Interview Tip: GitOps = Git is the source of truth. ArgoCD = the reconciler that keeps cluster in sync with Git. Helm = templating for K8s manifests. The key benefit: if cluster is destroyed, ArgoCD can recreate everything from Git in minutes. Rollback = git revert commit.

Disaster Recovery Strategies

RTO vs RPO tradeoffs · cold vs warm vs hot standby · cost vs availability spectrum

        RTO = Recovery Time Objective (how fast you recover)  |  RPO = Recovery Point Objective (how much data you lose)
      

        💰 Low Cost
        ────────────────→
        💰💰💰 High Cost
      

TIER 1

🧊 Cold Standby

RTO: Hours

RPO: Hours (last backup)

→ Backups stored in S3 + Glacier
→ AMIs / snapshots ready
→ Infrastructure defined in Terraform
→ Spin up ONLY when disaster occurs
→ Nothing running = zero idle cost
→ Acceptable for non-critical services

Cost: $ (lowest)

TIER 2

🌡️ Warm Standby

RTO: Minutes (5-30 min)

RPO: Seconds (near real-time)

→ Scaled-down infra running in DR region
→ RDS Multi-AZ + Read Replicas
→ DynamoDB Global Tables
→ S3 Cross-Region Replication
→ Route 53 health check failover
→ Scale up when primary fails

Cost: $$ (moderate)

TIER 3

🔥 Hot Standby

RTO: Seconds (near zero)

RPO: Zero (no data loss)

→ Full infra running in BOTH regions
→ Active-active or active-passive
→ Route 53 health checks + failover
→ Instant traffic rerouting
→ DynamoDB Global Tables (sync)
→ For payment / order / auth services

Cost: $$$ (highest)

💡 Hybrid Strategy (Cost-Sensitive E-Commerce)

Critical Services → Hot/Warm Standby:
→ Order creation, Payments, User auth
→ Cannot afford even 1 minute of downtime
→ Revenue directly impacted

Non-Critical Services → Cold Standby:
→ Reviews, Feedback, Complaints
→ Users can tolerate some downtime
→ Saves significant DR infrastructure cost

Interview Tip: Always ask "what is the RTO and RPO requirement?" before recommending a DR strategy. Then mention cost. The hybrid approach (hot standby for critical + cold for non-critical) is the best answer for cost-sensitive customers. Route 53 health checks are the routing mechanism.

Observability & Monitoring Stack

Metrics · logs · traces · alerts · the three pillars of observability with tools

📊

Prometheus

Metrics Collection

Pull-based scraping kube-state-metrics node-exporter ServiceMonitor CRD Alertmanager PromQL queries

📈

Grafana

Visualization

Custom dashboards Namespace-wise CPU/RAM API latency histogram Error rate panels Import: dashboard 315, 6417 Slack/email alerts

🔍

AWS X-Ray

Distributed Tracing

Request tracing Service map Latency breakdown API bottlenecks Cross-service traces Error root cause

🐕

Datadog

Full Observability

APM + tracing Log management Infrastructure map DaemonSet agent Custom metrics Anomaly detection

📋 3 Pillars of Observability

📊

Metrics

Numeric measurements over time. CPU %, memory usage, request rate, error rate, latency p99. → Prometheus + Grafana

📝

Logs

Timestamped text records of events. Application errors, access logs, audit logs. → Fluent Bit + Loki / Datadog / CloudWatch

🔍

Traces

End-to-end request flow across microservices. Shows WHERE latency comes from. → AWS X-Ray / OpenTelemetry / Jaeger

⚙️ ServiceMonitor Pattern

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

→ Prometheus Operator auto-discovers
→ No manual target config needed
→ App just needs /metrics endpoint

Interview Tip: The 3 pillars = Metrics + Logs + Traces. kube-state-metrics = K8s object states (pod running/pending). node-exporter = hardware metrics (CPU/RAM/disk). ServiceMonitor = how Prometheus auto-discovers new apps. Alertmanager routes to Slack/email. CloudWatch for AWS native, Prometheus+Grafana for K8s metrics.

🚀 DevOps Interview Diagrams