DevOps Real-World Use Cases

OPS-101 Reduce Production Image Size

High Priority Optimization

Description

Our current Node.js production image is 1.2GB. This slows down deployments and autoscaling. We need to reduce it to under 200MB.

Acceptance Criteria

Final image size < 200MB.
Must not contain source code (only build artifacts).
Must not contain devDependencies.

👨‍💻 Senior DevOps Solution

Strategy: Use Multi-Stage Builds. Stage 1 installs all dependencies and builds the app. Stage 2 copies only the necessary artifacts to a lightweight Alpine image.

# Stage 1: Builder
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
# Install ALL dependencies (including dev) for building
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Runner
FROM node:18-alpine
WORKDIR /app
ENV NODE_ENV=production
# Copy only package files to install prod dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy built artifacts from builder
COPY --from=builder /app/dist ./dist
# Run as non-root user for security
USER node
CMD ["node", "dist/main.min.js"]

Key Takeaway: Multi-stage builds separate the build environment (heavy) from the runtime environment (light), drastically reducing attack surface and size.

OPS-205 Implement Liveness & Readiness Probes

Medium Reliability

Description

During high load, our API sometimes freezes but the container stays "Running", so K8s doesn't restart it. Also, during startup, traffic is sent before the DB connection is ready, causing 500 errors.

Acceptance Criteria

Restart container if application freezes (Deadlock).
Do not send traffic until DB connection is established.

👨‍💻 Senior DevOps Solution

Strategy: Configure `livenessProbe` to restart dead containers and `readinessProbe` to control traffic flow.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-api
spec:
  template:
    spec:
      containers:
      - name: api
        image: my-api:v1
        # Liveness: Restart if this fails (App is dead)
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
        # Readiness: Remove from load balancer if this fails (App is busy/starting)
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10

Key Takeaway: Liveness keeps the app running. Readiness keeps the app serving traffic only when it's actually capable.

OPS-310 Automate Testing and Docker Push

Medium Automation

Description

Developers are manually building and pushing Docker images. This is error-prone. We need a GitHub Actions pipeline that runs tests and pushes to registry only on main branch.

Acceptance Criteria

Run unit tests on every PR.
Build and Push Docker image ONLY on merge to `main`.
Tag image with commit SHA.

👨‍💻 Senior DevOps Solution

Strategy: Use GitHub Actions with conditional jobs.

name: CI/CD Pipeline

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run Tests
      run: npm install && npm test

  build-and-push:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Login to Docker Hub
      uses: docker/login-action@v2
      with:
        username: ${{ secrets.DOCKER_USER }}
        password: ${{ secrets.DOCKER_TOKEN }}
    - name: Build and Push
      run: |
        docker build -t myapp:${{ github.sha }} .
        docker push myapp:${{ github.sha }}

Key Takeaway: `needs: test` ensures we never push a broken image. `if: github.ref == ...` ensures we only deploy from the stable branch.

OPS-401 Provision AWS VPC with Terraform

High Priority Infrastructure

Description

We need a reproducible network environment in AWS. Manually clicking in the console is not scalable. We need a Terraform module to provision a VPC with public and private subnets.

Acceptance Criteria

Create 1 VPC (10.0.0.0/16).
Create 2 Public Subnets (for Load Balancers).
Create 2 Private Subnets (for App Servers).
Configure Internet Gateway and NAT Gateway.

👨‍💻 Senior DevOps Solution

Strategy: Use the official AWS VPC module for best practices and minimal code.

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "my-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = false

  tags = {
    Terraform = "true"
    Environment = "dev"
  }
}

Key Takeaway: Don't reinvent the wheel. Using community modules speeds up development and ensures standard compliance.

OPS-505 Automate Nginx Configuration Updates

Medium Configuration

Description

We have 50 web servers. Updating `nginx.conf` manually is impossible. We need an Ansible playbook to push the new config and reload Nginx without downtime.

Acceptance Criteria

Copy `nginx.conf` to all web servers.
Validate the configuration before reloading.
Reload Nginx service only if config changed.

👨‍💻 Senior DevOps Solution

Strategy: Use Ansible `template` module and `handlers` for efficient updates.

---
- name: Update Nginx Configuration
  hosts: webservers
  become: yes
  tasks:
    - name: Copy Nginx config
      template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/nginx.conf
        validate: 'nginx -t -c %s'
      notify: Reload Nginx

  handlers:
    - name: Reload Nginx
      service:
        name: nginx
        state: reloaded

Key Takeaway: The `validate` parameter prevents breaking the server with a bad config. `handlers` ensure the service only reloads when necessary.

OPS-602 Setup High CPU Alerting

High Priority Monitoring

Description

We had an outage because a server stuck at 100% CPU for 2 hours unnoticed. We need a Prometheus alert rule to notify us via Slack.

Acceptance Criteria

Trigger alert if CPU usage > 80% for more than 5 minutes.
Send notification to #ops-alerts channel.

👨‍💻 Senior DevOps Solution

Strategy: Define a Prometheus AlertRule.

groups:
- name: host-alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected on {{ $labels.instance }}"
      description: "CPU usage is above 80% (current value: {{ $value }})"

Key Takeaway: The `for: 5m` clause prevents flapping alerts caused by temporary spikes.