🚀 DevOps Interview Diagrams

// 8 essential architecture diagrams for your DevOps interview prep

01 · 3-Tier AWS
02 · K8s Architecture
03 · CI/CD Pipeline
04 · Terraform Structure
05 · Security Layers
06 · GitOps Flow
07 · DR Strategies
08 · Monitoring Stack
3-Tier AWS Architecture
Production-ready VPC layout across 3 AZs · public + private subnets · multi-AZ database · ALB at each tier
🌐 Internet  →  Route 53 (DNS)  →  CloudFront (CDN)  →  WAF + Shield
VPC · 10.0.0.0/16
⚖️ frontend-alb · Internet-Facing · Port 443 HTTPS · spans all 3 AZs
☁️ Availability Zone A
public-web-subnet-a10.0.0.0/20
🌐NAT Gateway
private-web-subnet-a10.0.48.0/20
🖥️frontend-server
private-app-subnet-a10.0.96.0/20
⚙️backend-server
private-db-subnet-a10.0.144.0/20
🗄️RDS Primary ★
☁️ Availability Zone B
public-web-subnet-b10.0.16.0/20
🌐NAT Gateway
private-web-subnet-b10.0.64.0/20
🖥️frontend-server
private-app-subnet-b10.0.112.0/20
⚙️backend-server
private-db-subnet-b10.0.160.0/20
🗄️RDS Standby ↔
☁️ Availability Zone C
public-web-subnet-c10.0.32.0/20
private-web-subnet-c10.0.80.0/20
🖥️frontend-server
private-app-subnet-c10.0.128.0/20
⚙️backend-server
private-db-subnet-c10.0.176.0/20
📖RDS Read Replica
⚖️ backend-alb · Internal · Port 8080 · spans all 3 AZs · NOT internet-facing
Interview Tip: Key points to mention — (1) NAT Gateway in public subnet for private server outbound traffic, (2) ALBs in public subnets route to private servers, (3) backend-alb is internal-only, (4) RDS Multi-AZ = synchronous replication = automatic failover in 1-2 mins, (5) Read Replica in AZ-C for read scaling, (6) Security Groups at instance level + NACLs at subnet level.
Kubernetes Architecture
Control plane components · worker node components · how kubectl commands flow through the cluster
🧠 Control Plane (Master Node)
🌐 API Server
Front door of the cluster. ALL communication goes through here — kubectl, pipelines, internal components. Validates and processes REST requests.
🗄️ etcd
Distributed key-value store. Brain of the cluster. Stores ALL cluster state — configs, secrets, resource definitions, current status.
📅 Scheduler
Watches for unscheduled pods. Selects best node based on: resources, taints/tolerations, affinity rules, node conditions.
🔄 Controller Manager
Runs control loops. ReplicaSet controller maintains desired pod count. Node controller monitors node health. Job controller manages batch jobs.
☁️ Cloud Controller
Integrates with cloud provider APIs. Manages Load Balancers, Node lifecycle and Routes in AWS/GCP/Azure.
⚙️ Worker Nodes
Node 1 · m5.xlarge
🔗 kubelet
Agent on every node. Receives pod specs from API server. Ensures containers are running and healthy.
🔀 kube-proxy
Manages iptables/IPVS rules. Enables pod-to-pod and service-to-pod communication.
📦 Container Runtime
containerd / CRI-O. Actually runs the containers.
pod: api-1
pod: api-2
pod: worker-1
Node 2 · m5.xlarge
🔗 kubelet
Same as Node 1. Every node has a kubelet reporting to control plane.
🔀 kube-proxy
Each node has its own kube-proxy maintaining local network rules.
📦 Container Runtime
containerd / CRI-O. Pulls images, creates containers.
pod: frontend-1
pod: cache-1
📋 Scheduler Decision Flow
1. Filtering → remove nodes with insufficient CPU/Memory, taints without toleration, wrong affinity
2. Scoring → rank remaining nodes by available resources, spread policies, affinity score
3. Binding → pod assigned to highest scoring node
Interview Tip: etcd = database of cluster, API Server = single entry point, Scheduler = decides WHERE pods run, Controller Manager = ensures desired state matches actual state, kubelet = ensures pods RUN on each node, kube-proxy = handles NETWORKING between pods.
Complete CI/CD Pipeline
Code commit → build → test → scan → deploy · with tools at each stage · from developer to production
1
💾 Code Commit & PR
GitHub GitLab
Developer pushes feature branch → opens Pull Request → pipeline triggers automatically on PR creation. Branch protection rules prevent direct commits to main.
2
🔍 Static Code Analysis
SonarQube ESLint
Quality Gates: code coverage ≥ 80%, zero critical issues, duplication < 3%. If gate fails → pipeline stops, developer notified. No broken code proceeds.
3
🧪 Unit & Integration Tests
JUnit pytest
Run all unit tests and integration tests. Parallel test execution to reduce pipeline time. Test results published as artifacts for review.
4
🔒 Security Scanning (SAST)
OWASP ZAP Terrascan
SAST scan for SQL injection, XSS, broken auth. For Terraform code: Checkov/Terrascan checks for open security groups, unencrypted storage, IAM misconfigs.
5
🐳 Build & Container Scan
Docker Trivy Aqua
Multi-stage Docker build (slim final image). Trivy/Aqua scans image for CVEs → --exit-code 1 fails pipeline on CRITICAL findings. Sign image for Binary Authorization.
6
📤 Push to Registry
ECR GCR
Tag: app:1.3.0-build-42-abc123f · Push verified + signed image to ECR. Update Helm chart values.yaml with new image tag. Commit back to GitOps repo.
7
🚀 Deploy via ArgoCD
ArgoCD Helm
ArgoCD detects new image tag in Git → triggers Helm upgrade on EKS → Rolling update / Canary strategy → readiness/liveness probes validate. Auto-rollback on failure.
8
✅ Manual Gate → Production
Approval
Pipeline pauses for senior engineer review → approves Terraform plan / deployment plan → ArgoCD syncs to production cluster. Full audit trail in Git.
Interview Tip: Mention BOTH CI (build/test/scan = GitHub Actions / GitLab CI / Jenkins) and CD (deploy = ArgoCD) as separate concerns. Key terms: Quality Gate, SAST, image signing, binary authorization, GitOps, rolling update, canary, manual approval gate.
Terraform Directory Structure
Multi-environment modular structure · best practices · remote state · workspace strategy
terraform/
modules/ # reusable components
vpc/
main.tf variables.tf outputs.tf
eks/
main.tf variables.tf outputs.tf
rds/
main.tf variables.tf outputs.tf
security/
main.tf variables.tf outputs.tf
 
environments/ # env-specific configs
dev/
main.tf backend.tf
terraform.tfvars # t3.medium
stage/
main.tf backend.tf
terraform.tfvars # t3.large
prod/
main.tf backend.tf
terraform.tfvars # m5.xlarge
 
landing-zone/ # multi-account
logging-account/
security-account/
network-account/
workload-accounts/
🔒 Remote Backend Config
terraform {
  backend "s3" {
    bucket        = "tf-state-prod"
    key          = "prod/eks/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    dynamodb_table = "tf-lock"
  }
}
🔄 Module Call Pattern
module "eks" {
  source       = "../../modules/eks"
  cluster_name = "prod-cluster"
  node_type    = var.node_instance_type
  min_nodes    = 3
  max_nodes    = 10
}
🛡️ Prevent Manual Deletion
1. Tag all TF resources: ManagedBy = "terraform"
2. IAM Deny policy with condition:
  → Effect: Deny
  → Action: ec2:TerminateInstances
  → Condition: ResourceTag/ManagedBy = terraform
3. Attach Deny policy to all IAM users
4. Explicit Deny always overrides Allow
State Locking
S3 stores state file. DynamoDB creates lock entry when apply runs. Second engineer gets "Error: state locked" until first finishes.
Drift Detection
terraform plan -detailed-exitcode detects drift. terraform import pulls manual changes into state. terraform refresh syncs state without applying.
Safe Rename
terraform state mv old_name new_name renames resource in state only. No infrastructure change. Prevents accidental delete+recreate.
Interview Tip: Always mention 5 Terraform best practices: (1) modular structure, (2) remote backend with versioning, (3) state locking with DynamoDB, (4) terraform plan before apply, (5) secrets in Vault/Secrets Manager never in .tf files.
DevOps Security Layers
Defense in depth · security at every layer from code to runtime · DevSecOps approach
💻
Layer 1 — Code Security
Security starts at the developer's machine. Secrets should never enter the codebase. SAST catches vulnerabilities before code is committed.
SonarQube git-secrets pre-commit hooks OWASP ZAP (DAST)
🐳
Layer 2 — Container Security
Scan images for CVEs before they reach any cluster. Only signed images allowed to run. Minimal base images reduce attack surface.
Trivy (--exit-code 1) Aqua Security Binary Authorization Cosign (image signing)
☸️
Layer 3 — Kubernetes Security
Control what runs in the cluster, who can access what, and how pods communicate with each other.
RBAC NetworkPolicy OPA Gatekeeper Pod Security Admission IRSA (no access keys) Istio mTLS
☁️
Layer 4 — Cloud / Network Security
AWS-level protection. Control traffic at subnet level, protect against DDoS, manage secrets centrally.
Security Groups NACLs WAF AWS Shield Secrets Manager IAM least privilege VPC Flow Logs
🔍
Layer 5 — Audit & Compliance
Continuous monitoring for threats, configuration drift and compliance violations. Everything is logged and audited.
CloudTrail GuardDuty AWS Config Falco (runtime) Prisma Cloud Security Hub
Interview Tip: Always answer security questions in layers — code → container → Kubernetes → cloud → audit. Key phrase: "Defense in Depth". Never hardcode secrets. IRSA for EKS pod AWS access (no access keys). Explicit Deny in IAM always overrides Allow.
GitOps Flow with ArgoCD
Git as single source of truth · continuous reconciliation · drift detection · declarative deployments
👨‍💻
Developer
Pushes code to feature branch, opens PR
🔄
CI Pipeline
Build, test, scan, push image to ECR
📦
Git Repo
Update Helm values.yaml with new image tag
🔁
ArgoCD
Detects Git change, syncs to cluster
☸️
EKS Cluster
Rolling update deployed, health checks pass
🔄 GitOps vs Traditional CI/CD
❌ Traditional Push
Pipeline runs kubectl apply
Manual drift undetected
No single source of truth
Rollback = re-run pipeline
No auto-reconciliation
✅ GitOps Pull
ArgoCD pulls from Git
Drift detected + corrected
Git = single source of truth
Rollback = git revert
Continuous reconciliation
🎯 ArgoCD Key Concepts
A
Application
ArgoCD resource that links a Git repo path to a cluster/namespace target
S
Sync
Process of making cluster match the desired state in Git. Auto or manual.
D
Drift
When actual cluster state differs from Git. ArgoCD shows OutOfSync status.
H
Health
ArgoCD checks if deployed resources are healthy — pods running, ingress responding.
Interview Tip: GitOps = Git is the source of truth. ArgoCD = the reconciler that keeps cluster in sync with Git. Helm = templating for K8s manifests. The key benefit: if cluster is destroyed, ArgoCD can recreate everything from Git in minutes. Rollback = git revert commit.
Disaster Recovery Strategies
RTO vs RPO tradeoffs · cold vs warm vs hot standby · cost vs availability spectrum
RTO = Recovery Time Objective (how fast you recover)  |  RPO = Recovery Point Objective (how much data you lose)
💰 Low Cost ────────────────→ 💰💰💰 High Cost
TIER 1
🧊 Cold Standby
RTO: Hours
RPO: Hours (last backup)
→ Backups stored in S3 + Glacier
→ AMIs / snapshots ready
→ Infrastructure defined in Terraform
→ Spin up ONLY when disaster occurs
→ Nothing running = zero idle cost
→ Acceptable for non-critical services
Cost: $ (lowest)
TIER 2
🌡️ Warm Standby
RTO: Minutes (5-30 min)
RPO: Seconds (near real-time)
→ Scaled-down infra running in DR region
→ RDS Multi-AZ + Read Replicas
→ DynamoDB Global Tables
→ S3 Cross-Region Replication
→ Route 53 health check failover
→ Scale up when primary fails
Cost: $$ (moderate)
TIER 3
🔥 Hot Standby
RTO: Seconds (near zero)
RPO: Zero (no data loss)
→ Full infra running in BOTH regions
→ Active-active or active-passive
→ Route 53 health checks + failover
→ Instant traffic rerouting
→ DynamoDB Global Tables (sync)
→ For payment / order / auth services
Cost: $$$ (highest)
💡 Hybrid Strategy (Cost-Sensitive E-Commerce)
Critical Services → Hot/Warm Standby:
→ Order creation, Payments, User auth
→ Cannot afford even 1 minute of downtime
→ Revenue directly impacted
Non-Critical Services → Cold Standby:
→ Reviews, Feedback, Complaints
→ Users can tolerate some downtime
→ Saves significant DR infrastructure cost
Interview Tip: Always ask "what is the RTO and RPO requirement?" before recommending a DR strategy. Then mention cost. The hybrid approach (hot standby for critical + cold for non-critical) is the best answer for cost-sensitive customers. Route 53 health checks are the routing mechanism.
Observability & Monitoring Stack
Metrics · logs · traces · alerts · the three pillars of observability with tools
📊
Prometheus
Metrics Collection
Pull-based scraping kube-state-metrics node-exporter ServiceMonitor CRD Alertmanager PromQL queries
📈
Grafana
Visualization
Custom dashboards Namespace-wise CPU/RAM API latency histogram Error rate panels Import: dashboard 315, 6417 Slack/email alerts
🔍
AWS X-Ray
Distributed Tracing
Request tracing Service map Latency breakdown API bottlenecks Cross-service traces Error root cause
🐕
Datadog
Full Observability
APM + tracing Log management Infrastructure map DaemonSet agent Custom metrics Anomaly detection
📋 3 Pillars of Observability
📊
Metrics
Numeric measurements over time. CPU %, memory usage, request rate, error rate, latency p99. → Prometheus + Grafana
📝
Logs
Timestamped text records of events. Application errors, access logs, audit logs. → Fluent Bit + Loki / Datadog / CloudWatch
🔍
Traces
End-to-end request flow across microservices. Shows WHERE latency comes from. → AWS X-Ray / OpenTelemetry / Jaeger
⚙️ ServiceMonitor Pattern
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

→ Prometheus Operator auto-discovers
→ No manual target config needed
→ App just needs /metrics endpoint
Interview Tip: The 3 pillars = Metrics + Logs + Traces. kube-state-metrics = K8s object states (pod running/pending). node-exporter = hardware metrics (CPU/RAM/disk). ServiceMonitor = how Prometheus auto-discovers new apps. Alertmanager routes to Slack/email. CloudWatch for AWS native, Prometheus+Grafana for K8s metrics.