This section presents common enterprise scenarios formatted as Jira tickets. Each case includes the business requirement, technical constraints, and a senior-level implementation strategy.
Our current Node.js production image is 1.2GB. This slows down deployments and autoscaling. We need to reduce it to under 200MB.
- Final image size < 200MB.
- Must not contain source code (only build artifacts).
- Must not contain devDependencies.
👨💻 Senior DevOps Solution
Strategy: Use Multi-Stage Builds. Stage 1 installs all dependencies and builds the app. Stage 2 copies only the necessary artifacts to a lightweight Alpine image.
# Stage 1: Builder
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
# Install ALL dependencies (including dev) for building
RUN npm ci
COPY . .
RUN npm run build
# Stage 2: Runner
FROM node:18-alpine
WORKDIR /app
ENV NODE_ENV=production
# Copy only package files to install prod dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy built artifacts from builder
COPY --from=builder /app/dist ./dist
# Run as non-root user for security
USER node
CMD ["node", "dist/main.min.js"]
Key Takeaway: Multi-stage builds separate the build environment (heavy) from the runtime environment (light), drastically reducing attack surface and size.
During high load, our API sometimes freezes but the container stays "Running", so K8s doesn't restart it. Also, during startup, traffic is sent before the DB connection is ready, causing 500 errors.
- Restart container if application freezes (Deadlock).
- Do not send traffic until DB connection is established.
👨💻 Senior DevOps Solution
Strategy: Configure `livenessProbe` to restart dead containers and `readinessProbe` to control traffic flow.
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-api
spec:
template:
spec:
containers:
- name: api
image: my-api:v1
# Liveness: Restart if this fails (App is dead)
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
# Readiness: Remove from load balancer if this fails (App is busy/starting)
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Key Takeaway: Liveness keeps the app running. Readiness keeps the app serving traffic only when it's actually capable.
Developers are manually building and pushing Docker images. This is error-prone. We need a GitHub Actions pipeline that runs tests and pushes to registry only on main branch.
- Run unit tests on every PR.
- Build and Push Docker image ONLY on merge to `main`.
- Tag image with commit SHA.
👨💻 Senior DevOps Solution
Strategy: Use GitHub Actions with conditional jobs.
name: CI/CD Pipeline
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Tests
run: npm install && npm test
build-and-push:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKER_USER }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and Push
run: |
docker build -t myapp:${{ github.sha }} .
docker push myapp:${{ github.sha }}
Key Takeaway: `needs: test` ensures we never push a broken image. `if: github.ref == ...` ensures we only deploy from the stable branch.
We need a reproducible network environment in AWS. Manually clicking in the console is not scalable. We need a Terraform module to provision a VPC with public and private subnets.
- Create 1 VPC (10.0.0.0/16).
- Create 2 Public Subnets (for Load Balancers).
- Create 2 Private Subnets (for App Servers).
- Configure Internet Gateway and NAT Gateway.
👨💻 Senior DevOps Solution
Strategy: Use the official AWS VPC module for best practices and minimal code.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "my-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
enable_vpn_gateway = false
tags = {
Terraform = "true"
Environment = "dev"
}
}
Key Takeaway: Don't reinvent the wheel. Using community modules speeds up development and ensures standard compliance.
We have 50 web servers. Updating `nginx.conf` manually is impossible. We need an Ansible playbook to push the new config and reload Nginx without downtime.
- Copy `nginx.conf` to all web servers.
- Validate the configuration before reloading.
- Reload Nginx service only if config changed.
👨💻 Senior DevOps Solution
Strategy: Use Ansible `template` module and `handlers` for efficient updates.
---
- name: Update Nginx Configuration
hosts: webservers
become: yes
tasks:
- name: Copy Nginx config
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: 'nginx -t -c %s'
notify: Reload Nginx
handlers:
- name: Reload Nginx
service:
name: nginx
state: reloaded
Key Takeaway: The `validate` parameter prevents breaking the server with a bad config. `handlers` ensure the service only reloads when necessary.
We had an outage because a server stuck at 100% CPU for 2 hours unnoticed. We need a Prometheus alert rule to notify us via Slack.
- Trigger alert if CPU usage > 80% for more than 5 minutes.
- Send notification to #ops-alerts channel.
👨💻 Senior DevOps Solution
Strategy: Define a Prometheus AlertRule.
groups:
- name: host-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }})"
Key Takeaway: The `for: 5m` clause prevents flapping alerts caused by temporary spikes.