3-Tier AWS Architecture
Production-ready VPC layout across 3 AZs · public + private subnets · multi-AZ database · ALB at each tier
🌐 Internet → Route 53 (DNS) → CloudFront (CDN) → WAF + Shield
↓
VPC · 10.0.0.0/16
⚖️
frontend-alb · Internet-Facing · Port 443 HTTPS · spans all 3 AZs
↓
☁️ Availability Zone A
public-web-subnet-a10.0.0.0/20
NAT Gateway
private-web-subnet-a10.0.48.0/20
frontend-server
private-app-subnet-a10.0.96.0/20
backend-server
private-db-subnet-a10.0.144.0/20
RDS Primary ★
☁️ Availability Zone B
public-web-subnet-b10.0.16.0/20
NAT Gateway
private-web-subnet-b10.0.64.0/20
frontend-server
private-app-subnet-b10.0.112.0/20
backend-server
private-db-subnet-b10.0.160.0/20
RDS Standby ↔
☁️ Availability Zone C
public-web-subnet-c10.0.32.0/20
private-web-subnet-c10.0.80.0/20
frontend-server
private-app-subnet-c10.0.128.0/20
backend-server
private-db-subnet-c10.0.176.0/20
RDS Read Replica
⚖️
backend-alb · Internal · Port 8080 · spans all 3 AZs · NOT internet-facing
Interview Tip: Key points to mention — (1) NAT Gateway in public subnet for private server outbound traffic, (2) ALBs in public subnets route to private servers, (3) backend-alb is internal-only, (4) RDS Multi-AZ = synchronous replication = automatic failover in 1-2 mins, (5) Read Replica in AZ-C for read scaling, (6) Security Groups at instance level + NACLs at subnet level.
Kubernetes Architecture
Control plane components · worker node components · how kubectl commands flow through the cluster
🧠 Control Plane (Master Node)
🌐 API Server
Front door of the cluster. ALL communication goes through here — kubectl, pipelines, internal components. Validates and processes REST requests.
🗄️ etcd
Distributed key-value store. Brain of the cluster. Stores ALL cluster state — configs, secrets, resource definitions, current status.
📅 Scheduler
Watches for unscheduled pods. Selects best node based on: resources, taints/tolerations, affinity rules, node conditions.
🔄 Controller Manager
Runs control loops. ReplicaSet controller maintains desired pod count. Node controller monitors node health. Job controller manages batch jobs.
☁️ Cloud Controller
Integrates with cloud provider APIs. Manages Load Balancers, Node lifecycle and Routes in AWS/GCP/Azure.
⚙️ Worker Nodes
Node 1 · m5.xlarge
🔗 kubelet
Agent on every node. Receives pod specs from API server. Ensures containers are running and healthy.
🔀 kube-proxy
Manages iptables/IPVS rules. Enables pod-to-pod and service-to-pod communication.
📦 Container Runtime
containerd / CRI-O. Actually runs the containers.
pod: api-1
pod: api-2
pod: worker-1
Node 2 · m5.xlarge
🔗 kubelet
Same as Node 1. Every node has a kubelet reporting to control plane.
🔀 kube-proxy
Each node has its own kube-proxy maintaining local network rules.
📦 Container Runtime
containerd / CRI-O. Pulls images, creates containers.
pod: frontend-1
pod: cache-1
📋 Scheduler Decision Flow
1. Filtering → remove nodes with insufficient CPU/Memory, taints without toleration, wrong affinity
2. Scoring → rank remaining nodes by available resources, spread policies, affinity score
3. Binding → pod assigned to highest scoring node
2. Scoring → rank remaining nodes by available resources, spread policies, affinity score
3. Binding → pod assigned to highest scoring node
Interview Tip: etcd = database of cluster, API Server = single entry point, Scheduler = decides WHERE pods run, Controller Manager = ensures desired state matches actual state, kubelet = ensures pods RUN on each node, kube-proxy = handles NETWORKING between pods.
Complete CI/CD Pipeline
Code commit → build → test → scan → deploy · with tools at each stage · from developer to production
1
💾 Code Commit & PR
GitHub
GitLab
Developer pushes feature branch → opens Pull Request → pipeline triggers automatically on PR creation. Branch protection rules prevent direct commits to main.
2
🔍 Static Code Analysis
SonarQube
ESLint
Quality Gates: code coverage ≥ 80%, zero critical issues, duplication < 3%. If gate fails → pipeline stops, developer notified. No broken code proceeds.
3
🧪 Unit & Integration Tests
JUnit
pytest
Run all unit tests and integration tests. Parallel test execution to reduce pipeline time. Test results published as artifacts for review.
4
🔒 Security Scanning (SAST)
OWASP ZAP
Terrascan
SAST scan for SQL injection, XSS, broken auth. For Terraform code: Checkov/Terrascan checks for open security groups, unencrypted storage, IAM misconfigs.
5
🐳 Build & Container Scan
Docker
Trivy
Aqua
Multi-stage Docker build (slim final image). Trivy/Aqua scans image for CVEs → --exit-code 1 fails pipeline on CRITICAL findings. Sign image for Binary Authorization.
6
📤 Push to Registry
ECR
GCR
Tag: app:1.3.0-build-42-abc123f · Push verified + signed image to ECR. Update Helm chart values.yaml with new image tag. Commit back to GitOps repo.
7
🚀 Deploy via ArgoCD
ArgoCD
Helm
ArgoCD detects new image tag in Git → triggers Helm upgrade on EKS → Rolling update / Canary strategy → readiness/liveness probes validate. Auto-rollback on failure.
8
✅ Manual Gate → Production
Approval
Pipeline pauses for senior engineer review → approves Terraform plan / deployment plan → ArgoCD syncs to production cluster. Full audit trail in Git.
Interview Tip: Mention BOTH CI (build/test/scan = GitHub Actions / GitLab CI / Jenkins) and CD (deploy = ArgoCD) as separate concerns. Key terms: Quality Gate, SAST, image signing, binary authorization, GitOps, rolling update, canary, manual approval gate.
Terraform Directory Structure
Multi-environment modular structure · best practices · remote state · workspace strategy
terraform/
modules/ # reusable components
vpc/
main.tf variables.tf outputs.tf
eks/
main.tf variables.tf outputs.tf
rds/
main.tf variables.tf outputs.tf
security/
main.tf variables.tf outputs.tf
environments/ # env-specific configs
dev/
main.tf backend.tf
terraform.tfvars # t3.medium
stage/
main.tf backend.tf
terraform.tfvars # t3.large
prod/
main.tf backend.tf
terraform.tfvars # m5.xlarge
landing-zone/ # multi-account
logging-account/
security-account/
network-account/
workload-accounts/
🔒 Remote Backend Config
terraform {
backend "s3" {
bucket = "tf-state-prod"
key = "prod/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "tf-lock"
}
}
backend "s3" {
bucket = "tf-state-prod"
key = "prod/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "tf-lock"
}
}
🔄 Module Call Pattern
module "eks" {
source = "../../modules/eks"
cluster_name = "prod-cluster"
node_type = var.node_instance_type
min_nodes = 3
max_nodes = 10
}
source = "../../modules/eks"
cluster_name = "prod-cluster"
node_type = var.node_instance_type
min_nodes = 3
max_nodes = 10
}
🛡️ Prevent Manual Deletion
1. Tag all TF resources: ManagedBy = "terraform"
2. IAM Deny policy with condition:
→ Effect: Deny
→ Action: ec2:TerminateInstances
→ Condition: ResourceTag/ManagedBy = terraform
3. Attach Deny policy to all IAM users
4. Explicit Deny always overrides Allow
2. IAM Deny policy with condition:
→ Effect: Deny
→ Action: ec2:TerminateInstances
→ Condition: ResourceTag/ManagedBy = terraform
3. Attach Deny policy to all IAM users
4. Explicit Deny always overrides Allow
State Locking
S3 stores state file. DynamoDB creates lock entry when apply runs. Second engineer gets "Error: state locked" until first finishes.
Drift Detection
terraform plan -detailed-exitcode detects drift. terraform import pulls manual changes into state. terraform refresh syncs state without applying.
Safe Rename
terraform state mv old_name new_name renames resource in state only. No infrastructure change. Prevents accidental delete+recreate.
Interview Tip: Always mention 5 Terraform best practices: (1) modular structure, (2) remote backend with versioning, (3) state locking with DynamoDB, (4) terraform plan before apply, (5) secrets in Vault/Secrets Manager never in .tf files.
DevOps Security Layers
Defense in depth · security at every layer from code to runtime · DevSecOps approach
Layer 1 — Code Security
Security starts at the developer's machine. Secrets should never enter the codebase. SAST catches vulnerabilities before code is committed.
SonarQube
git-secrets
pre-commit hooks
OWASP ZAP (DAST)
Layer 2 — Container Security
Scan images for CVEs before they reach any cluster. Only signed images allowed to run. Minimal base images reduce attack surface.
Trivy (--exit-code 1)
Aqua Security
Binary Authorization
Cosign (image signing)
Layer 3 — Kubernetes Security
Control what runs in the cluster, who can access what, and how pods communicate with each other.
RBAC
NetworkPolicy
OPA Gatekeeper
Pod Security Admission
IRSA (no access keys)
Istio mTLS
Layer 4 — Cloud / Network Security
AWS-level protection. Control traffic at subnet level, protect against DDoS, manage secrets centrally.
Security Groups
NACLs
WAF
AWS Shield
Secrets Manager
IAM least privilege
VPC Flow Logs
Layer 5 — Audit & Compliance
Continuous monitoring for threats, configuration drift and compliance violations. Everything is logged and audited.
CloudTrail
GuardDuty
AWS Config
Falco (runtime)
Prisma Cloud
Security Hub
Interview Tip: Always answer security questions in layers — code → container → Kubernetes → cloud → audit. Key phrase: "Defense in Depth". Never hardcode secrets. IRSA for EKS pod AWS access (no access keys). Explicit Deny in IAM always overrides Allow.
GitOps Flow with ArgoCD
Git as single source of truth · continuous reconciliation · drift detection · declarative deployments
Developer
Pushes code to feature branch, opens PR
→
CI Pipeline
Build, test, scan, push image to ECR
→
Git Repo
Update Helm values.yaml with new image tag
→
ArgoCD
Detects Git change, syncs to cluster
→
EKS Cluster
Rolling update deployed, health checks pass
🔄 GitOps vs Traditional CI/CD
🎯 ArgoCD Key Concepts
A
Application
ArgoCD resource that links a Git repo path to a cluster/namespace target
S
Sync
Process of making cluster match the desired state in Git. Auto or manual.
D
Drift
When actual cluster state differs from Git. ArgoCD shows OutOfSync status.
H
Health
ArgoCD checks if deployed resources are healthy — pods running, ingress responding.
Interview Tip: GitOps = Git is the source of truth. ArgoCD = the reconciler that keeps cluster in sync with Git. Helm = templating for K8s manifests. The key benefit: if cluster is destroyed, ArgoCD can recreate everything from Git in minutes. Rollback = git revert commit.
Disaster Recovery Strategies
RTO vs RPO tradeoffs · cold vs warm vs hot standby · cost vs availability spectrum
RTO = Recovery Time Objective (how fast you recover) | RPO = Recovery Point Objective (how much data you lose)
💰 Low Cost
────────────────→
💰💰💰 High Cost
TIER 1
🧊 Cold Standby
RTO: Hours
RPO: Hours (last backup)
→ Backups stored in S3 + Glacier
→ AMIs / snapshots ready
→ Infrastructure defined in Terraform
→ Spin up ONLY when disaster occurs
→ Nothing running = zero idle cost
→ Acceptable for non-critical services
→ AMIs / snapshots ready
→ Infrastructure defined in Terraform
→ Spin up ONLY when disaster occurs
→ Nothing running = zero idle cost
→ Acceptable for non-critical services
Cost: $ (lowest)
TIER 2
🌡️ Warm Standby
RTO: Minutes (5-30 min)
RPO: Seconds (near real-time)
→ Scaled-down infra running in DR region
→ RDS Multi-AZ + Read Replicas
→ DynamoDB Global Tables
→ S3 Cross-Region Replication
→ Route 53 health check failover
→ Scale up when primary fails
→ RDS Multi-AZ + Read Replicas
→ DynamoDB Global Tables
→ S3 Cross-Region Replication
→ Route 53 health check failover
→ Scale up when primary fails
Cost: $$ (moderate)
TIER 3
🔥 Hot Standby
RTO: Seconds (near zero)
RPO: Zero (no data loss)
→ Full infra running in BOTH regions
→ Active-active or active-passive
→ Route 53 health checks + failover
→ Instant traffic rerouting
→ DynamoDB Global Tables (sync)
→ For payment / order / auth services
→ Active-active or active-passive
→ Route 53 health checks + failover
→ Instant traffic rerouting
→ DynamoDB Global Tables (sync)
→ For payment / order / auth services
Cost: $$$ (highest)
💡 Hybrid Strategy (Cost-Sensitive E-Commerce)
Critical Services → Hot/Warm Standby:
→ Order creation, Payments, User auth
→ Cannot afford even 1 minute of downtime
→ Revenue directly impacted
→ Order creation, Payments, User auth
→ Cannot afford even 1 minute of downtime
→ Revenue directly impacted
Non-Critical Services → Cold Standby:
→ Reviews, Feedback, Complaints
→ Users can tolerate some downtime
→ Saves significant DR infrastructure cost
→ Reviews, Feedback, Complaints
→ Users can tolerate some downtime
→ Saves significant DR infrastructure cost
Interview Tip: Always ask "what is the RTO and RPO requirement?" before recommending a DR strategy. Then mention cost. The hybrid approach (hot standby for critical + cold for non-critical) is the best answer for cost-sensitive customers. Route 53 health checks are the routing mechanism.
Observability & Monitoring Stack
Metrics · logs · traces · alerts · the three pillars of observability with tools
Prometheus
Metrics Collection
Pull-based scraping
kube-state-metrics
node-exporter
ServiceMonitor CRD
Alertmanager
PromQL queries
Grafana
Visualization
Custom dashboards
Namespace-wise CPU/RAM
API latency histogram
Error rate panels
Import: dashboard 315, 6417
Slack/email alerts
AWS X-Ray
Distributed Tracing
Request tracing
Service map
Latency breakdown
API bottlenecks
Cross-service traces
Error root cause
Datadog
Full Observability
APM + tracing
Log management
Infrastructure map
DaemonSet agent
Custom metrics
Anomaly detection
📋 3 Pillars of Observability
📊
Metrics
Numeric measurements over time. CPU %, memory usage, request rate, error rate, latency p99. → Prometheus + Grafana
📝
Logs
Timestamped text records of events. Application errors, access logs, audit logs. → Fluent Bit + Loki / Datadog / CloudWatch
🔍
Traces
End-to-end request flow across microservices. Shows WHERE latency comes from. → AWS X-Ray / OpenTelemetry / Jaeger
⚙️ ServiceMonitor Pattern
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
→ Prometheus Operator auto-discovers
→ No manual target config needed
→ App just needs /metrics endpoint
→ No manual target config needed
→ App just needs /metrics endpoint
Interview Tip: The 3 pillars = Metrics + Logs + Traces. kube-state-metrics = K8s object states (pod running/pending). node-exporter = hardware metrics (CPU/RAM/disk). ServiceMonitor = how Prometheus auto-discovers new apps. Alertmanager routes to Slack/email. CloudWatch for AWS native, Prometheus+Grafana for K8s metrics.