12 Chapter 9: Continuous Integration and Continuous Deployment
12.1 Learning Objectives
By the end of this chapter, you will be able to:
- Explain the principles and benefits of continuous integration and continuous deployment
- Distinguish between continuous integration, continuous delivery, and continuous deployment
- Design and implement CI/CD pipelines using GitHub Actions
- Configure automated builds, tests, and deployments
- Implement deployment strategies including blue-green, canary, and rolling deployments
- Manage environment configurations and secrets securely
- Monitor deployments and implement rollback procedures
- Apply infrastructure as code principles for reproducible environments
- Troubleshoot common CI/CD pipeline issues
12.2 9.1 The Evolution of Software Delivery
Software delivery has transformed dramatically over the past decades. What once took months or years now happens in minutes. Understanding this evolution helps appreciate why CI/CD practices exist and why they matter.
12.2.1 9.1.1 The Old Way: Manual Releases
In traditional software development, releases were major events:
Traditional Release Process (weeks to months)
═══════════════════════════════════════════════════════════════
Development Phase (weeks)
│
▼
Code Freeze
│
▼
Integration Phase (days to weeks)
├── Merge all developer branches
├── Fix integration conflicts
└── Stabilize combined code
│
▼
Testing Phase (days to weeks)
├── QA team tests entire application
├── Bug fixes and retesting
└── Sign-off from stakeholders
│
▼
Release Preparation (days)
├── Create release branch
├── Build release artifacts
├── Write release notes
└── Prepare deployment scripts
│
▼
Deployment (hours to days)
├── Schedule maintenance window
├── Notify users of downtime
├── Manual server updates
├── Database migrations
├── Smoke testing
└── Prayer and hope
│
▼
Post-Release (days)
├── Monitor for issues
├── Hotfix critical bugs
└── Begin next development cycle
Problems with this approach:
- Integration hell: Merging weeks of isolated work caused massive conflicts
- Long feedback loops: Bugs weren’t discovered until late in the cycle
- Risky deployments: Large changes meant large risks
- Infrequent releases: Customers waited months for features and fixes
- Stressful releases: “Release weekends” became dreaded events
- Fear of change: Teams avoided changes to avoid risk
12.2.2 9.1.2 The CI/CD Revolution
Modern practices flip this model:
Modern CI/CD Process (minutes to hours)
═══════════════════════════════════════════════════════════════
Developer commits code
│
▼ (seconds)
Automated pipeline triggers
│
▼ (minutes)
┌─────────────────────────────────────────────────────────────┐
│ Build → Lint → Unit Tests → Integration Tests → Security │
└─────────────────────────────────────────────────────────────┘
│
▼ (minutes)
Deploy to staging environment
│
▼ (minutes)
Automated E2E tests on staging
│
▼ (automatic or one-click)
Deploy to production
│
▼ (continuous)
Monitoring and alerting
Benefits:
- Fast feedback: Know within minutes if changes break anything
- Small changes: Easier to review, test, and debug
- Reduced risk: Small, frequent deployments are safer than large, rare ones
- Faster delivery: Features reach users in hours, not months
- Happier teams: Routine deployments instead of stressful events
- Higher quality: Automated testing catches issues before users do
12.2.3 9.1.3 Key Terminology
Understanding the distinctions between related terms:
┌─────────────────────────────────────────────────────────────────────────┐
│ CI/CD TERMINOLOGY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CONTINUOUS INTEGRATION (CI) │
│ ───────────────────────── │
│ • Developers integrate code frequently (at least daily) │
│ • Each integration triggers automated build and tests │
│ • Problems detected early, when they're easy to fix │
│ • Main branch stays stable and deployable │
│ │
│ CONTINUOUS DELIVERY (CD) │
│ ──────────────────────── │
│ • Code is always in a deployable state │
│ • Automated pipeline prepares release artifacts │
│ • Deployment to production requires manual approval │
│ • "Push-button" releases whenever business decides │
│ │
│ CONTINUOUS DEPLOYMENT (CD) │
│ ───────────────────────── │
│ • Every change that passes tests deploys automatically │
│ • No manual intervention required │
│ • Highest level of automation │
│ • Requires mature testing and monitoring │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Visual comparison:
Continuous Continuous Continuous
Integration Delivery Deployment
Code Commit ● ● ●
│ │ │ │
▼ ▼ ▼ ▼
Build ● Automated ● Automated ● Automated
│ │ │ │
▼ ▼ ▼ ▼
Test ● Automated ● Automated ● Automated
│ │ │ │
▼ ▼ ▼ ▼
Deploy to Staging ○ Optional ● Automated ● Automated
│ │ │ │
▼ ▼ ▼ ▼
Deploy to Prod ○ Manual ◐ Manual Trigger ● Automated
(One-click)
12.3 9.2 Continuous Integration Fundamentals
Continuous Integration (CI) is the practice of frequently integrating code changes into a shared repository, where each integration is verified by automated builds and tests.
12.3.1 9.2.1 Core CI Practices
1. Maintain a Single Source Repository
All code lives in version control. Everyone works from the same repository.
Repository Structure:
├── main branch (always deployable)
├── feature branches (short-lived)
└── All configuration in version control
├── Application code
├── Test code
├── Build scripts
├── Infrastructure definitions
└── CI/CD pipeline definitions
2. Automate the Build
Building software should require a single command:
# One command to build everything
npm run build
# or
./gradlew build
# or
make allThe build should:
- Compile all code
- Run static analysis
- Generate artifacts
- Be reproducible (same inputs → same outputs)
3. Make the Build Self-Testing
Every build runs automated tests:
# Build includes tests
npm run build # Compiles and runs tests
npm test # Just tests
# Build fails if tests fail
$ npm test
FAIL src/calculator.test.js
✕ adds numbers correctly (5ms)
npm ERR! Test failed.4. Everyone Commits Frequently
Integrate at least daily—more often is better:
Good:
Monday: 3 commits
Tuesday: 4 commits
Wednesday: 2 commits
Thursday: 5 commits
Friday: 3 commits
Bad:
Monday-Thursday: Working locally...
Friday: 1 massive commit with a week's work
5. Every Commit Triggers a Build
Automated systems build and test every change:
Commit pushed
│
▼
CI server detects change
│
▼
Pipeline executes automatically
│
├── Success → Green checkmark ✓
│
└── Failure → Red X, team notified ✗
6. Keep the Build Fast
Fast feedback is essential. Target build times:
Build Stage Target Time
─────────────────────────────────
Lint < 30 seconds
Unit tests < 5 minutes
Integration tests < 10 minutes
Full pipeline < 15 minutes
If build takes > 15 minutes, consider:
• Parallelizing tests
• Optimizing slow tests
• Splitting pipeline stages
7. Test in a Clone of Production
Test environments should mirror production:
Production Environment
├── Ubuntu 22.04
├── Node.js 20.x
├── PostgreSQL 15
├── Redis 7
└── nginx 1.24
CI Test Environment (should match!)
├── Ubuntu 22.04
├── Node.js 20.x
├── PostgreSQL 15
├── Redis 7
└── nginx 1.24
8. Make It Easy to Get Latest Deliverables
Anyone should be able to get the latest working version:
# Get latest artifacts
aws s3 cp s3://builds/latest/app.zip .
# Or use package registry
npm install @company/app@latest
docker pull company/app:latest9. Everyone Can See What’s Happening
Build status is visible to all:
┌─────────────────────────────────────────────────────────────────────────┐
│ CI DASHBOARD │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ main branch: ✓ Build #1234 passed (3m ago) │
│ develop branch: ✓ Build #567 passed (15m ago) │
│ feature/auth: ✗ Build #89 failed - Test failure (1h ago) │
│ feature/api: ◐ Build #90 in progress... │
│ │
│ Recent Activity: │
│ ├── alice: Merged PR #142 into main │
│ ├── bob: Fixed failing test in feature/auth │
│ └── carol: Opened PR #143 for review │
│ │
└─────────────────────────────────────────────────────────────────────────┘
10. Automate Deployment
Deployment should be automated, not manual:
# Not this:
ssh production-server
cd /var/www/app
git pull
npm install
npm run build
pm2 restart all
# This:
git push origin main # Triggers automated deployment12.3.2 9.2.2 The CI Feedback Loop
CI creates a rapid feedback loop:
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ Write Code ──────► Commit ──────► CI Pipeline ──────► Feedback │
│ ▲ │ │
│ │ │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ │ │ │
│ └──────────────┤ Fix if broken ◄──────────────┘ │
│ │ Continue if passing│ │
│ │ │ │
│ └──────────────────────┘ │
│ │
│ Feedback Time: Minutes, not days │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When the build breaks:
- Stop what you’re doing
- Fix the build immediately
- Don’t commit more broken code on top
“The first rule of Continuous Integration is: when the build breaks, fixing it becomes the team’s top priority.”
12.3.3 9.2.3 CI Anti-Patterns
Ignoring Broken Builds:
❌ "The build's been red for a week, but we're too busy to fix it."
✓ Fix broken builds immediately. A red build is an emergency.
Infrequent Integration:
❌ Committing once a week with massive changes
✓ Commit multiple times daily with small changes
Skipping Tests:
❌ "I'll add tests later" or "Tests are too slow, skip them"
✓ Tests are non-negotiable. Optimize slow tests.
Not Running Pipeline Locally:
❌ "It works on my machine" → Push → CI fails
✓ Run the same checks locally before pushing
Long-Lived Feature Branches:
❌ Feature branch that diverges for months
✓ Short-lived branches, merged within days
12.4 9.3 Building CI Pipelines with GitHub Actions
GitHub Actions is GitHub’s built-in CI/CD platform. It’s free for public repositories and has generous free tiers for private repositories.
12.4.1 9.3.1 GitHub Actions Concepts
┌─────────────────────────────────────────────────────────────────────────┐
│ GITHUB ACTIONS HIERARCHY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WORKFLOW │
│ ├── Defined in .github/workflows/*.yml │
│ ├── Triggered by events (push, PR, schedule, etc.) │
│ └── Contains one or more jobs │
│ │
│ JOB │
│ ├── Runs on a specific runner (ubuntu, windows, macos) │
│ ├── Contains one or more steps │
│ ├── Jobs run in parallel by default │
│ └── Can depend on other jobs │
│ │
│ STEP │
│ ├── Individual task within a job │
│ ├── Either runs a command or uses an action │
│ └── Steps run sequentially │
│ │
│ ACTION │
│ ├── Reusable unit of code │
│ ├── Published in GitHub Marketplace │
│ └── Example: actions/checkout, actions/setup-node │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Visual representation:
Workflow: ci.yml
│
├── Job: lint
│ ├── Step: Checkout code
│ ├── Step: Setup Node.js
│ └── Step: Run linter
│
├── Job: test (depends on: lint)
│ ├── Step: Checkout code
│ ├── Step: Setup Node.js
│ ├── Step: Install dependencies
│ └── Step: Run tests
│
└── Job: build (depends on: test)
├── Step: Checkout code
├── Step: Setup Node.js
├── Step: Build application
└── Step: Upload artifacts
12.4.2 9.3.2 Basic Workflow Structure
# .github/workflows/ci.yml
# Workflow name (displayed in GitHub UI)
name: CI
# Triggers - when should this workflow run?
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
# Jobs to execute
jobs:
# Job identifier
build:
# Runner environment
runs-on: ubuntu-latest
# Job steps
steps:
# Use a pre-built action
- name: Checkout repository
uses: actions/checkout@v4
# Run a shell command
- name: Display Node version
run: node --version
# Multi-line command
- name: Install and test
run: |
npm ci
npm test12.4.3 9.3.3 Complete CI Pipeline Example
Here’s a comprehensive CI pipeline for a Node.js application:
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main, develop]
# Environment variables available to all jobs
env:
NODE_VERSION: '20'
jobs:
# ============================================
# JOB 1: Code Quality Checks
# ============================================
lint:
name: Lint & Format
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ESLint
run: npm run lint
- name: Check Prettier formatting
run: npm run format:check
- name: Run TypeScript compiler
run: npm run type-check
# ============================================
# JOB 2: Unit Tests
# ============================================
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
needs: lint # Only run if lint passes
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm test -- --coverage --reporters=default --reporters=jest-junit
env:
JEST_JUNIT_OUTPUT_DIR: ./reports
- name: Upload coverage report
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage/
- name: Upload test results
uses: actions/upload-artifact@v4
if: always() # Upload even if tests fail
with:
name: test-results
path: reports/junit.xml
# ============================================
# JOB 3: Integration Tests
# ============================================
integration-tests:
name: Integration Tests
runs-on: ubuntu-latest
needs: lint
# Service containers for integration tests
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run database migrations
run: npm run db:migrate
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
# ============================================
# JOB 4: Build
# ============================================
build:
name: Build Application
runs-on: ubuntu-latest
needs: [unit-tests, integration-tests]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build application
run: npm run build
env:
NODE_ENV: production
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-output
path: dist/
retention-days: 7
# ============================================
# JOB 5: Security Scan
# ============================================
security:
name: Security Scan
runs-on: ubuntu-latest
needs: lint
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run npm audit
run: npm audit --audit-level=high
- name: Run Snyk security scan
uses: snyk/actions/node@master
continue-on-error: true # Don't fail build, just report
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
# ============================================
# JOB 6: E2E Tests (only on main/develop)
# ============================================
e2e-tests:
name: E2E Tests
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Download build artifacts
uses: actions/download-artifact@v4
with:
name: build-output
path: dist/
- name: Run Cypress tests
uses: cypress-io/github-action@v6
with:
start: npm run start:test
wait-on: 'http://localhost:3000'
wait-on-timeout: 120
- name: Upload Cypress screenshots
uses: actions/upload-artifact@v4
if: failure()
with:
name: cypress-screenshots
path: cypress/screenshots/
- name: Upload Cypress videos
uses: actions/upload-artifact@v4
if: always()
with:
name: cypress-videos
path: cypress/videos/12.4.4 9.3.4 Workflow Triggers
on:
# Push to specific branches
push:
branches:
- main
- 'release/**' # Wildcard pattern
paths:
- 'src/**' # Only when src/ changes
- '!**.md' # Ignore markdown files
# Pull request events
pull_request:
types: [opened, synchronize, reopened]
branches: [main]
# Scheduled runs (cron syntax)
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTC
# Manual trigger
workflow_dispatch:
inputs:
environment:
description: 'Environment to deploy to'
required: true
default: 'staging'
type: choice
options:
- staging
- production
# Triggered by another workflow
workflow_call:
inputs:
version:
required: true
type: string
# Repository events
release:
types: [published]
issues:
types: [opened, labeled]12.4.5 9.3.5 Job Dependencies and Parallelization
jobs:
# These run in parallel (no dependencies)
lint:
runs-on: ubuntu-latest
steps: [...]
security:
runs-on: ubuntu-latest
steps: [...]
# This waits for lint to complete
test:
runs-on: ubuntu-latest
needs: lint
steps: [...]
# This waits for both lint AND security
build:
runs-on: ubuntu-latest
needs: [lint, security]
steps: [...]
# This waits for test AND build
deploy:
runs-on: ubuntu-latest
needs: [test, build]
steps: [...]Execution flow:
┌──────┐ ┌──────────┐
│ lint │ │ security │
└──┬───┘ └────┬─────┘
│ │
▼ │
┌──────┐ │
│ test │ │
└──┬───┘ │
│ │
└──────┬───────┘
│
▼
┌───────┐
│ build │
└───┬───┘
│
▼
┌────────┐
│ deploy │
└────────┘
12.4.6 9.3.6 Matrix Builds
Test across multiple versions and platforms:
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
node-version: [18, 20, 22]
exclude:
# Don't test Node 18 on macOS
- os: macos-latest
node-version: 18
include:
# Add specific configuration
- os: ubuntu-latest
node-version: 20
coverage: true
fail-fast: false # Continue other jobs if one fails
steps:
- uses: actions/checkout@v4
- name: Setup Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
- run: npm ci
- run: npm test
- name: Upload coverage
if: matrix.coverage
run: npm run coverage:uploadThis creates 8 parallel jobs (3 OS × 3 Node versions - 1 exclusion).
12.4.7 9.3.7 Caching Dependencies
Speed up pipelines by caching dependencies:
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Automatic caching with setup-node
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm' # Automatically caches node_modules
- run: npm ci
- run: npm run build
# Manual caching for more control
build-manual-cache:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Cache node modules
id: cache-npm
uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
- name: Cache build output
uses: actions/cache@v4
with:
path: dist
key: ${{ runner.os }}-build-${{ hashFiles('src/**') }}
- run: npm ci
- run: npm run build12.4.8 9.3.8 Secrets and Environment Variables
jobs:
deploy:
runs-on: ubuntu-latest
# Environment with protection rules
environment:
name: production
url: https://example.com
env:
# Available to all steps in this job
NODE_ENV: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
echo "Deploying to $DEPLOY_URL"
./deploy.sh
env:
# Secrets from repository settings
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
DEPLOY_URL: ${{ vars.PRODUCTION_URL }}
# GitHub-provided variables
GITHUB_SHA: ${{ github.sha }}
GITHUB_REF: ${{ github.ref }}Setting up secrets:
Repository Settings → Secrets and variables → Actions
Repository secrets:
├── AWS_ACCESS_KEY_ID
├── AWS_SECRET_ACCESS_KEY
├── DATABASE_URL
└── API_KEY
Environment secrets (per environment):
├── production
│ ├── DATABASE_URL (production database)
│ └── API_KEY (production API key)
└── staging
├── DATABASE_URL (staging database)
└── API_KEY (staging API key)
12.5 9.4 Continuous Deployment Strategies
Deploying to production requires careful strategies to minimize risk and enable quick rollbacks.
12.5.1 9.4.1 Deployment Strategies Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT STRATEGIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RECREATE │
│ • Stop old version, start new version │
│ • Simple but causes downtime │
│ • Use for: Non-critical apps, major database migrations │
│ │
│ ROLLING │
│ • Gradually replace instances │
│ • Zero downtime │
│ • Use for: Most applications │
│ │
│ BLUE-GREEN │
│ • Two identical environments │
│ • Switch traffic instantly │
│ • Use for: Critical apps needing instant rollback │
│ │
│ CANARY │
│ • Deploy to small subset first │
│ • Gradually increase if healthy │
│ • Use for: Risk-averse deployments, A/B testing │
│ │
│ FEATURE FLAGS │
│ • Deploy code, enable features separately │
│ • Instant enable/disable without deployment │
│ • Use for: Trunk-based development, gradual rollouts │
│ │
└─────────────────────────────────────────────────────────────────────────┘
12.5.2 9.4.2 Recreate Deployment
The simplest strategy: stop everything, deploy, start everything.
Before:
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: v1.0 ● │
│ ├──► Server 2: v1.0 ● │
│ └──► Server 3: v1.0 ● │
└─────────────────────────────────────────────────────────────────────────┘
During deployment (DOWNTIME):
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: Updating... ○ │
│ ├──► Server 2: Updating... ○ │
│ └──► Server 3: Updating... ○ │
└─────────────────────────────────────────────────────────────────────────┘
After:
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: v2.0 ● │
│ ├──► Server 2: v2.0 ● │
│ └──► Server 3: v2.0 ● │
└─────────────────────────────────────────────────────────────────────────┘
Pros:
- Simple to implement
- Clean state—no version mixing
Cons:
- Causes downtime
- All-or-nothing risk
12.5.3 9.4.3 Rolling Deployment
Update instances one at a time, maintaining availability:
Step 1: Update Server 1
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: v2.0 ● (updated) │
│ ├──► Server 2: v1.0 ● │
│ └──► Server 3: v1.0 ● │
└─────────────────────────────────────────────────────────────────────────┘
Step 2: Update Server 2
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: v2.0 ● │
│ ├──► Server 2: v2.0 ● (updated) │
│ └──► Server 3: v1.0 ● │
└─────────────────────────────────────────────────────────────────────────┘
Step 3: Update Server 3
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► Server 1: v2.0 ● │
│ ├──► Server 2: v2.0 ● │
│ └──► Server 3: v2.0 ● (updated) │
└─────────────────────────────────────────────────────────────────────────┘
Implementation (Kubernetes):
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max extra pods during update
maxUnavailable: 0 # Never reduce below desired count
template:
spec:
containers:
- name: myapp
image: myapp:v2.0Pros:
- Zero downtime
- Gradual rollout
- Easy to implement
Cons:
- Multiple versions running simultaneously
- Slower than recreate
- Rollback requires another rolling update
12.5.4 9.4.4 Blue-Green Deployment
Maintain two identical environments. Switch traffic instantly.
BLUE Environment (current production):
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ Server 1: v1.0 ● │
│ Server 2: v1.0 ● ◄──── 100% Traffic │
│ Server 3: v1.0 ● │
│ │
└─────────────────────────────────────────────────────────────────────────┘
GREEN Environment (staging new version):
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ Server 1: v2.0 ● │
│ Server 2: v2.0 ● ◄──── 0% Traffic (testing) │
│ Server 3: v2.0 ● │
│ │
└─────────────────────────────────────────────────────────────────────────┘
After switch:
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ BLUE: v1.0 ●●● ◄──── 0% Traffic (standby for rollback) │
│ │
│ GREEN: v2.0 ●●● ◄──── 100% Traffic │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation with AWS/Route 53:
# GitHub Actions blue-green deployment
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Deploy to green environment
run: |
aws ecs update-service \
--cluster production \
--service myapp-green \
--task-definition myapp:${{ github.sha }}
- name: Wait for green to be healthy
run: |
aws ecs wait services-stable \
--cluster production \
--services myapp-green
- name: Run smoke tests on green
run: |
curl -f https://green.example.com/health
npm run test:smoke -- --url=https://green.example.com
- name: Switch traffic to green
run: |
aws route53 change-resource-record-sets \
--hosted-zone-id ${{ secrets.HOSTED_ZONE_ID }} \
--change-batch file://switch-to-green.json
- name: Keep blue as rollback
run: |
echo "Blue environment available for rollback"
echo "To rollback, switch DNS back to blue"Pros:
- Instant switch and rollback
- Full testing before going live
- Zero downtime
Cons:
- Requires double infrastructure
- More expensive
- Database migrations are tricky
12.5.5 9.4.5 Canary Deployment
Deploy to a small percentage of users first, then gradually increase:
Step 1: Deploy to 5% (canary)
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► 95% ──► Production (v1.0): ●●●●●●●●●● │
│ │ │
│ └──► 5% ──► Canary (v2.0): ● │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 2: Monitor metrics. If healthy, increase to 25%
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ├──► 75% ──► Production (v1.0): ●●●●●●●● │
│ │ │
│ └──► 25% ──► Canary (v2.0): ●●● │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 3: Continue to 50%, 75%, 100%
┌─────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ └──► 100% ──► New Production (v2.0): ●●●●●●●●●● │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Canary with Kubernetes and Istio:
# VirtualService for traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.example.com
http:
- route:
- destination:
host: myapp-stable
port:
number: 80
weight: 95
- destination:
host: myapp-canary
port:
number: 80
weight: 5Automated Canary Analysis:
# GitHub Actions canary deployment
jobs:
canary:
runs-on: ubuntu-latest
steps:
- name: Deploy canary (5%)
run: kubectl apply -f canary-5-percent.yaml
- name: Wait and analyze metrics
run: |
sleep 300 # Wait 5 minutes
# Check error rate
ERROR_RATE=$(curl -s "prometheus/api/v1/query?query=error_rate{version='canary'}")
if [ "$ERROR_RATE" -gt "0.01" ]; then
echo "Error rate too high, rolling back"
kubectl apply -f rollback.yaml
exit 1
fi
# Check latency
LATENCY=$(curl -s "prometheus/api/v1/query?query=p99_latency{version='canary'}")
if [ "$LATENCY" -gt "500" ]; then
echo "Latency too high, rolling back"
kubectl apply -f rollback.yaml
exit 1
fi
- name: Increase to 25%
run: kubectl apply -f canary-25-percent.yaml
# ... continue pattern ...
- name: Full rollout
run: kubectl apply -f full-rollout.yamlPros:
- Minimal blast radius if issues
- Real production testing
- Data-driven promotion decisions
Cons:
- Complex to implement
- Requires good monitoring
- Multiple versions in production
12.5.6 9.4.6 Feature Flags
Deploy code to everyone but enable features selectively:
// Feature flag implementation
const LaunchDarkly = require('launchdarkly-node-server-sdk');
const client = LaunchDarkly.init(process.env.LD_SDK_KEY);
app.get('/checkout', async (req, res) => {
const user = { key: req.user.id, email: req.user.email };
// Check if new checkout is enabled for this user
const newCheckoutEnabled = await client.variation(
'new-checkout-flow',
user,
false // Default value
);
if (newCheckoutEnabled) {
return res.render('checkout-v2');
} else {
return res.render('checkout-v1');
}
});Feature flag strategies:
┌─────────────────────────────────────────────────────────────────────────┐
│ FEATURE FLAG STRATEGIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ BOOLEAN FLAG │
│ • Simple on/off │
│ • Example: dark_mode_enabled: true/false │
│ │
│ PERCENTAGE ROLLOUT │
│ • Gradually enable for more users │
│ • Example: new_feature: 25% of users │
│ │
│ USER TARGETING │
│ • Enable for specific users/groups │
│ • Example: beta_feature: [user_ids: 1, 2, 3] │
│ │
│ ENVIRONMENT-BASED │
│ • Different values per environment │
│ • Example: debug_mode: true (dev), false (prod) │
│ │
│ A/B TESTING │
│ • Different variants for different users │
│ • Example: checkout_button: "Buy Now" vs "Purchase" │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Pros:
- Decouple deployment from release
- Instant enable/disable
- Enables A/B testing
Cons:
- Code complexity (if/else everywhere)
- Technical debt (old flags)
- Testing combinations is hard
12.6 9.5 Environment Management
Managing multiple environments (development, staging, production) is crucial for safe deployments.
12.6.1 9.5.1 Environment Hierarchy
┌─────────────────────────────────────────────────────────────────────────┐
│ ENVIRONMENT HIERARCHY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LOCAL DEVELOPMENT │
│ • Developer's machine │
│ • Local database, mock services │
│ • Fast iteration │
│ │ │
│ ▼ │
│ CI ENVIRONMENT │
│ • Automated builds and tests │
│ • Ephemeral (created/destroyed per build) │
│ • Isolated from other builds │
│ │ │
│ ▼ │
│ DEVELOPMENT/DEV │
│ • Shared development environment │
│ • Latest code from develop branch │
│ • May be unstable │
│ │ │
│ ▼ │
│ STAGING/QA │
│ • Production-like environment │
│ • Pre-production testing │
│ • Same infrastructure as production │
│ │ │
│ ▼ │
│ PRODUCTION │
│ • Live environment with real users │
│ • Highest security and monitoring │
│ • Changes require approval │
│ │
└─────────────────────────────────────────────────────────────────────────┘
12.6.2 9.5.2 Environment Configuration
Environment Variables:
# .env.development
NODE_ENV=development
DATABASE_URL=postgresql://localhost:5432/app_dev
REDIS_URL=redis://localhost:6379
API_URL=http://localhost:3000
LOG_LEVEL=debug
DEBUG=true
# .env.staging
NODE_ENV=staging
DATABASE_URL=postgresql://staging-db.example.com:5432/app
REDIS_URL=redis://staging-redis.example.com:6379
API_URL=https://staging-api.example.com
LOG_LEVEL=info
DEBUG=false
# .env.production
NODE_ENV=production
DATABASE_URL=postgresql://prod-db.example.com:5432/app
REDIS_URL=redis://prod-redis.example.com:6379
API_URL=https://api.example.com
LOG_LEVEL=warn
DEBUG=falseConfiguration Management:
// config/index.js
const configs = {
development: {
database: {
host: 'localhost',
port: 5432,
name: 'app_dev',
pool: { min: 2, max: 10 }
},
cache: {
ttl: 60, // Short TTL for dev
enabled: false
},
features: {
newCheckout: true, // Enable all features in dev
darkMode: true
}
},
staging: {
database: {
host: process.env.DB_HOST,
port: 5432,
name: 'app_staging',
pool: { min: 5, max: 20 }
},
cache: {
ttl: 300,
enabled: true
},
features: {
newCheckout: true,
darkMode: true
}
},
production: {
database: {
host: process.env.DB_HOST,
port: 5432,
name: 'app_prod',
pool: { min: 10, max: 50 }
},
cache: {
ttl: 3600,
enabled: true
},
features: {
newCheckout: false, // Gradually enable via feature flags
darkMode: true
}
}
};
const env = process.env.NODE_ENV || 'development';
module.exports = configs[env];12.6.3 9.5.3 GitHub Actions Environments
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment:
name: staging
url: https://staging.example.com
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: ./deploy.sh staging
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
environment:
name: production
url: https://example.com
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: ./deploy.sh production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}Environment Protection Rules:
Repository Settings → Environments → production
Protection Rules:
☑ Required reviewers
• @team-leads
☑ Wait timer
• 30 minutes after staging deploy
☑ Restrict branches
• Only main branch can deploy
☑ Custom deployment branch policy
• Selected branches: main, release/*
12.6.4 9.5.4 Secrets Management
Never commit secrets:
# .gitignore
.env
.env.*
!.env.example
*.pem
*.key
secrets/Use environment-specific secrets:
# GitHub Actions
steps:
- name: Deploy
env:
# Different secrets per environment
DB_PASSWORD: ${{ secrets.DB_PASSWORD }} # Set per environment
API_KEY: ${{ secrets.API_KEY }}Secret rotation:
# Scheduled secret rotation check
name: Secret Rotation Check
on:
schedule:
- cron: '0 9 * * 1' # Every Monday at 9 AM
jobs:
check-secrets:
runs-on: ubuntu-latest
steps:
- name: Check secret age
run: |
# Check if secrets are older than 90 days
# Alert if rotation needed
./scripts/check-secret-age.sh
- name: Send alert
if: failure()
uses: actions/github-script@v7
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Secret rotation required',
body: 'Secrets are older than 90 days and should be rotated.'
})12.7 9.6 Infrastructure as Code
Infrastructure as Code (IaC) treats infrastructure configuration as software—versioned, reviewed, and automated.
12.7.1 9.6.1 Why Infrastructure as Code?
┌─────────────────────────────────────────────────────────────────────────┐
│ MANUAL vs. INFRASTRUCTURE AS CODE │
├────────────────────────────────────┬────────────────────────────────────┤
│ MANUAL │ INFRASTRUCTURE AS CODE │
├────────────────────────────────────┼────────────────────────────────────┤
│ Click through AWS console │ Define in code files │
│ Document steps in wiki │ Code IS the documentation │
│ "Works on my AWS account" │ Reproducible anywhere │
│ Drift from documented state │ Version controlled │
│ Slow to recreate │ Fast to provision │
│ Hard to review changes │ Pull request review │
│ Inconsistent environments │ Identical environments │
│ Scary to modify │ Confident changes │
└────────────────────────────────────┴────────────────────────────────────┘
12.7.2 9.6.2 Docker for Application Infrastructure
Dockerfile:
# Build stage
FROM node:20-alpine AS builder
WORKDIR /app
# Copy package files first (better caching)
COPY package*.json ./
RUN npm ci
# Copy source and build
COPY . .
RUN npm run build
# Production stage
FROM node:20-alpine AS production
WORKDIR /app
# Create non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Copy only production dependencies and build output
COPY --from=builder /app/package*.json ./
RUN npm ci --only=production
COPY --from=builder /app/dist ./dist
# Use non-root user
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "dist/server.js"]Docker Compose for local development:
# docker-compose.yml
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile
target: builder # Use builder stage for dev
ports:
- "3000:3000"
environment:
- NODE_ENV=development
- DATABASE_URL=postgresql://postgres:postgres@db:5432/app_dev
- REDIS_URL=redis://redis:6379
volumes:
- .:/app
- /app/node_modules # Don't override node_modules
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
command: npm run dev
db:
image: postgres:15
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: app_dev
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
# Test database for integration tests
db-test:
image: postgres:15
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: app_test
ports:
- "5433:5432"
volumes:
postgres_data:
redis_data:12.7.3 9.6.3 Terraform for Cloud Infrastructure
Basic Terraform structure:
# main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = {
Name = "${var.project_name}-vpc"
Environment = var.environment
}
}
# Subnets
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 1}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-public-${count.index + 1}"
}
}
# RDS Database
resource "aws_db_instance" "main" {
identifier = "${var.project_name}-db"
engine = "postgres"
engine_version = "15"
instance_class = var.db_instance_class
allocated_storage = 20
db_name = var.db_name
username = var.db_username
password = var.db_password
vpc_security_group_ids = [aws_security_group.db.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 7
skip_final_snapshot = var.environment != "production"
tags = {
Name = "${var.project_name}-db"
Environment = var.environment
}
}
# ECS Cluster
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
}
# ECS Service
resource "aws_ecs_service" "app" {
name = "${var.project_name}-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.app_count
launch_type = "FARGATE"
network_configuration {
subnets = aws_subnet.public[*].id
security_groups = [aws_security_group.app.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = 3000
}
}
Variables:
# variables.tf
variable "project_name" {
description = "Name of the project"
type = string
default = "taskflow"
}
variable "environment" {
description = "Deployment environment"
type = string
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.micro"
}
variable "app_count" {
description = "Number of app instances"
type = number
default = 2
}
Environments with workspaces:
# Create workspaces for each environment
terraform workspace new staging
terraform workspace new production
# Select workspace
terraform workspace select staging
# Apply with environment-specific variables
terraform apply -var-file="environments/staging.tfvars"12.7.4 9.6.4 CI/CD for Infrastructure
# .github/workflows/infrastructure.yml
name: Infrastructure
on:
push:
branches: [main]
paths:
- 'terraform/**'
pull_request:
branches: [main]
paths:
- 'terraform/**'
jobs:
terraform-plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.0
- name: Terraform Init
run: terraform init
working-directory: terraform
- name: Terraform Format Check
run: terraform fmt -check
working-directory: terraform
- name: Terraform Validate
run: terraform validate
working-directory: terraform
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: terraform
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Save plan
uses: actions/upload-artifact@v4
with:
name: tfplan
path: terraform/tfplan
terraform-apply:
runs-on: ubuntu-latest
needs: terraform-plan
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Download plan
uses: actions/download-artifact@v4
with:
name: tfplan
path: terraform
- name: Terraform Init
run: terraform init
working-directory: terraform
- name: Terraform Apply
run: terraform apply -auto-approve tfplan
working-directory: terraform
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}12.8 9.7 Deployment Automation
12.8.1 9.7.1 Complete Deployment Pipeline
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
default: 'staging'
type: choice
options:
- staging
- production
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ============================================
# Build and test
# ============================================
build:
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
- name: Build application
run: npm run build
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ============================================
# Deploy to staging
# ============================================
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment:
name: staging
url: https://staging.example.com
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster staging-cluster \
--service app-service \
--force-new-deployment
- name: Wait for deployment
run: |
aws ecs wait services-stable \
--cluster staging-cluster \
--services app-service
- name: Run smoke tests
run: |
npm run test:smoke -- --url=https://staging.example.com
# ============================================
# Deploy to production (requires approval)
# ============================================
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://example.com
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to production (Blue-Green)
run: |
# Deploy to green environment
aws ecs update-service \
--cluster production-cluster \
--service app-green \
--force-new-deployment
# Wait for green to be stable
aws ecs wait services-stable \
--cluster production-cluster \
--services app-green
- name: Run production smoke tests
run: |
npm run test:smoke -- --url=https://green.example.com
- name: Switch traffic to green
run: |
aws elbv2 modify-listener \
--listener-arn ${{ secrets.LISTENER_ARN }} \
--default-actions Type=forward,TargetGroupArn=${{ secrets.GREEN_TG_ARN }}
- name: Verify production
run: |
sleep 30
npm run test:smoke -- --url=https://example.com
- name: Create release
uses: actions/github-script@v7
with:
script: |
github.rest.repos.createRelease({
owner: context.repo.owner,
repo: context.repo.repo,
tag_name: `v${new Date().toISOString().split('T')[0]}-${context.sha.substring(0, 7)}`,
name: `Release ${new Date().toISOString().split('T')[0]}`,
body: `Deployed commit ${context.sha}`,
draft: false,
prerelease: false
})12.8.2 9.7.2 Database Migrations in CI/CD
# Database migration job
migrate:
runs-on: ubuntu-latest
needs: build
environment: ${{ github.event.inputs.environment || 'staging' }}
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Run migrations
run: npm run db:migrate
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
- name: Verify migration
run: npm run db:verify
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}Safe migration practices:
// migrations/20241209_add_user_role.js
// ✓ Safe: Adding a column with default
exports.up = async (knex) => {
await knex.schema.alterTable('users', (table) => {
table.string('role').defaultTo('user');
});
};
exports.down = async (knex) => {
await knex.schema.alterTable('users', (table) => {
table.dropColumn('role');
});
};// ✗ Dangerous: Renaming column (breaks running code)
// Instead, do it in phases:
// Phase 1: Add new column
exports.up_phase1 = async (knex) => {
await knex.schema.alterTable('users', (table) => {
table.string('full_name');
});
// Copy data
await knex.raw('UPDATE users SET full_name = name');
};
// Phase 2: Deploy code that uses both columns
// Phase 3: Remove old column (after all code updated)
exports.up_phase3 = async (knex) => {
await knex.schema.alterTable('users', (table) => {
table.dropColumn('name');
});
};12.8.3 9.7.3 Rollback Procedures
# .github/workflows/rollback.yml
name: Rollback
on:
workflow_dispatch:
inputs:
environment:
description: 'Environment to rollback'
required: true
type: choice
options:
- staging
- production
version:
description: 'Version to rollback to (leave empty for previous)'
required: false
jobs:
rollback:
runs-on: ubuntu-latest
environment: ${{ github.event.inputs.environment }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Get previous task definition
id: previous
run: |
if [ -n "${{ github.event.inputs.version }}" ]; then
TASK_DEF="${{ github.event.inputs.version }}"
else
# Get second most recent task definition
TASK_DEF=$(aws ecs list-task-definitions \
--family-prefix myapp \
--sort DESC \
--max-items 2 \
--query 'taskDefinitionArns[1]' \
--output text)
fi
echo "task_def=$TASK_DEF" >> $GITHUB_OUTPUT
- name: Rollback ECS service
run: |
aws ecs update-service \
--cluster ${{ github.event.inputs.environment }}-cluster \
--service app-service \
--task-definition ${{ steps.previous.outputs.task_def }}
- name: Wait for rollback
run: |
aws ecs wait services-stable \
--cluster ${{ github.event.inputs.environment }}-cluster \
--services app-service
- name: Verify rollback
run: |
URL="https://${{ github.event.inputs.environment }}.example.com"
if [ "${{ github.event.inputs.environment }}" = "production" ]; then
URL="https://example.com"
fi
curl -f "$URL/health"
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "🔄 Rollback completed for ${{ github.event.inputs.environment }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Rollback completed*\n• Environment: ${{ github.event.inputs.environment }}\n• Version: ${{ steps.previous.outputs.task_def }}\n• Triggered by: ${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}12.9 9.8 Monitoring and Observability
Deployment doesn’t end when code reaches production. Monitoring ensures the deployment is healthy.
12.9.1 9.8.1 The Three Pillars of Observability
┌─────────────────────────────────────────────────────────────────────────┐
│ THREE PILLARS OF OBSERVABILITY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LOGS │
│ ──── │
│ • Discrete events with context │
│ • Debug information │
│ • Audit trail │
│ • Example: "User 123 logged in at 2024-12-09T10:30:00Z" │
│ │
│ METRICS │
│ ─────── │
│ • Numeric measurements over time │
│ • Aggregatable and comparable │
│ • Alerts and dashboards │
│ • Example: request_duration_seconds{endpoint="/api/users"} = 0.125 │
│ │
│ TRACES │
│ ────── │
│ • Request flow across services │
│ • Latency breakdown │
│ • Dependency mapping │
│ • Example: Request -> API -> Database -> Cache -> Response │
│ │
└─────────────────────────────────────────────────────────────────────────┘
12.9.2 9.8.2 Key Metrics to Monitor
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY DEPLOYMENT METRICS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE FOUR GOLDEN SIGNALS (Google SRE) │
│ │
│ 1. LATENCY │
│ • Request duration │
│ • p50, p95, p99 percentiles │
│ • Alert: p99 > 500ms │
│ │
│ 2. TRAFFIC │
│ • Requests per second │
│ • Concurrent users │
│ • Alert: Unusual spike or drop │
│ │
│ 3. ERRORS │
│ • Error rate (5xx responses) │
│ • Failed requests │
│ • Alert: Error rate > 1% │
│ │
│ 4. SATURATION │
│ • CPU utilization │
│ • Memory usage │
│ • Queue depth │
│ • Alert: CPU > 80% for 5 minutes │
│ │
└─────────────────────────────────────────────────────────────────────────┘
12.9.3 9.8.3 Health Checks
// healthcheck.js
const express = require('express');
const db = require('./db');
const redis = require('./redis');
const router = express.Router();
// Basic liveness check (is the process running?)
router.get('/health/live', (req, res) => {
res.status(200).json({ status: 'ok' });
});
// Readiness check (is the app ready to serve traffic?)
router.get('/health/ready', async (req, res) => {
const checks = {
database: false,
redis: false,
memory: false
};
try {
// Database check
await db.raw('SELECT 1');
checks.database = true;
} catch (error) {
console.error('Database health check failed:', error);
}
try {
// Redis check
await redis.ping();
checks.redis = true;
} catch (error) {
console.error('Redis health check failed:', error);
}
// Memory check (under 90% usage)
const memUsage = process.memoryUsage();
const heapUsedPercent = memUsage.heapUsed / memUsage.heapTotal;
checks.memory = heapUsedPercent < 0.9;
const allHealthy = Object.values(checks).every(v => v);
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? 'healthy' : 'unhealthy',
checks,
timestamp: new Date().toISOString()
});
});
// Detailed health for debugging
router.get('/health/details', async (req, res) => {
res.json({
version: process.env.APP_VERSION || 'unknown',
commit: process.env.GIT_COMMIT || 'unknown',
uptime: process.uptime(),
memory: process.memoryUsage(),
env: process.env.NODE_ENV,
timestamp: new Date().toISOString()
});
});
module.exports = router;12.9.4 9.8.4 Post-Deployment Verification
# Post-deployment verification in CI/CD
verify-deployment:
runs-on: ubuntu-latest
needs: deploy
steps:
- name: Wait for deployment to stabilize
run: sleep 60
- name: Check health endpoint
run: |
for i in {1..5}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/health/ready)
if [ "$STATUS" = "200" ]; then
echo "Health check passed"
exit 0
fi
echo "Health check failed (attempt $i), waiting..."
sleep 10
done
echo "Health check failed after 5 attempts"
exit 1
- name: Check error rate
run: |
# Query Prometheus/Datadog for error rate
ERROR_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])")
# Parse and check error rate
# Alert if > 1%
- name: Check response times
run: |
# Run quick performance check
npm run test:performance -- --url=https://example.com --threshold=500ms
- name: Rollback if unhealthy
if: failure()
run: |
echo "Deployment verification failed, initiating rollback"
gh workflow run rollback.yml -f environment=production12.9.5 9.8.5 Alerting
# Example Prometheus alerting rules
groups:
- name: deployment-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: High error rate detected
description: Error rate is {{ $value | humanizePercentage }} over the last 5 minutes
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: High latency detected
description: p99 latency is {{ $value }}s
- alert: DeploymentFailed
expr: kube_deployment_status_replicas_ready / kube_deployment_spec_replicas < 1
for: 5m
labels:
severity: critical
annotations:
summary: Deployment not fully ready
description: Only {{ $value | humanizePercentage }} of pods are ready12.10 9.9 Troubleshooting CI/CD
12.10.1 9.9.1 Common Issues and Solutions
┌─────────────────────────────────────────────────────────────────────────┐
│ COMMON CI/CD ISSUES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ISSUE: Build fails but works locally │
│ ───────────────────────────────────── │
│ Causes: │
│ • Different Node/Python version │
│ • Missing environment variables │
│ • Cached dependencies out of sync │
│ • OS differences (Windows vs Linux) │
│ │
│ Solutions: │
│ • Match CI versions to local versions │
│ • Use .nvmrc or .python-version │
│ • Clear CI cache │
│ • Use Docker for consistency │
│ │
│ ───────────────────────────────────── │
│ │
│ ISSUE: Flaky tests │
│ ───────────────── │
│ Causes: │
│ • Race conditions │
│ • Time-dependent tests │
│ • Shared test state │
│ • External dependencies │
│ │
│ Solutions: │
│ • Use proper async/await │
│ • Mock time-dependent code │
│ • Isolate test data │
│ • Mock external services │
│ │
│ ───────────────────────────────────── │
│ │
│ ISSUE: Slow pipelines │
│ ─────────────────── │
│ Causes: │
│ • No caching │
│ • Sequential jobs that could parallel │
│ • Large Docker images │
│ • Too many dependencies │
│ │
│ Solutions: │
│ • Cache dependencies │
│ • Parallelize jobs │
│ • Use multi-stage Docker builds │
│ • Split test suites │
│ │
│ ───────────────────────────────────── │
│ │
│ ISSUE: Deployment succeeds but app broken │
│ ────────────────────────────────────── │
│ Causes: │
│ • Missing environment variables │
│ • Database migration issues │
│ • Incompatible dependencies │
│ • Configuration drift │
│ │
│ Solutions: │
│ • Comprehensive smoke tests │
│ • Health check endpoints │
│ • Staging environment that mirrors prod │
│ • Infrastructure as Code │
│ │
└─────────────────────────────────────────────────────────────────────────┘
12.10.2 9.9.2 Debugging Techniques
Debugging GitHub Actions:
jobs:
debug:
runs-on: ubuntu-latest
steps:
# Print all environment variables
- name: Debug environment
run: env | sort
# Print GitHub context
- name: Debug GitHub context
run: echo '${{ toJson(github) }}'
# Enable debug logging
- name: Debug step
run: echo "Debug info"
env:
ACTIONS_STEP_DEBUG: true
# SSH into runner for debugging
- name: Setup tmate session
if: failure()
uses: mxschmitt/action-tmate@v3
timeout-minutes: 15Debugging Docker builds:
# Build with verbose output
docker build --progress=plain -t myapp .
# Build specific stage
docker build --target builder -t myapp:builder .
# Run intermediate layer
docker run -it myapp:builder sh
# Check image layers
docker history myapp
# Inspect image
docker inspect myapp12.10.3 9.9.3 CI/CD Best Practices Checklist
CI/CD BEST PRACTICES CHECKLIST
═══════════════════════════════════════════════════════════════
CONTINUOUS INTEGRATION
☐ Single source repository
☐ Automated builds on every commit
☐ Fast feedback (< 15 minutes)
☐ Self-testing builds
☐ Fix broken builds immediately
☐ Keep the build green
TESTING
☐ Unit tests with high coverage
☐ Integration tests for critical paths
☐ E2E tests for user journeys
☐ Tests run in CI
☐ No flaky tests
DEPLOYMENT
☐ Automated deployments
☐ Multiple environments (dev, staging, prod)
☐ Production-like staging
☐ Deployment approval for production
☐ Rollback procedure documented and tested
SECURITY
☐ Secrets in secret manager (not in code)
☐ Dependency scanning
☐ Security scanning in CI
☐ Least privilege for CI credentials
☐ Audit logging
MONITORING
☐ Health check endpoints
☐ Key metrics monitored
☐ Alerts configured
☐ Post-deployment verification
☐ Logging and tracing
DOCUMENTATION
☐ Pipeline documented
☐ Runbook for common issues
☐ Rollback procedure documented
☐ Environment configuration documented
12.11 9.10 Chapter Summary
Continuous Integration and Continuous Deployment transform how teams deliver software. By automating builds, tests, and deployments, teams can ship faster with higher quality and lower risk.
Key takeaways from this chapter:
Continuous Integration means integrating code frequently, with automated builds and tests verifying each change. Problems are caught early when they’re easiest to fix.
Continuous Delivery ensures code is always in a deployable state, with push-button releases to production.
Continuous Deployment goes further—every change that passes tests deploys automatically to production.
GitHub Actions provides powerful CI/CD capabilities with workflows defined in YAML, jobs that run in parallel or sequence, and matrix builds for testing across configurations.
Deployment strategies like rolling, blue-green, and canary deployments minimize risk and enable quick rollbacks.
Environment management requires careful configuration of development, staging, and production environments with proper secrets management.
Infrastructure as Code treats infrastructure like software—versioned, reviewed, and automated.
Monitoring and observability are essential for knowing whether deployments are healthy. The three pillars—logs, metrics, and traces—provide visibility into system behavior.
Troubleshooting CI/CD requires understanding common issues like environment differences, flaky tests, and slow pipelines.
12.12 9.11 Key Terms
| Term | Definition |
|---|---|
| Continuous Integration (CI) | Practice of frequently integrating code with automated verification |
| Continuous Delivery | Keeping code always deployable with push-button releases |
| Continuous Deployment | Automatically deploying every change that passes tests |
| Pipeline | Automated sequence of build, test, and deploy stages |
| Workflow | GitHub Actions term for an automated process |
| Job | Unit of work in a CI/CD pipeline |
| Runner | Machine that executes CI/CD jobs |
| Artifact | File or package produced by a build |
| Blue-Green Deployment | Strategy using two identical environments for instant switching |
| Canary Deployment | Gradual rollout to small percentage of users |
| Rolling Deployment | Updating instances one at a time |
| Feature Flag | Toggle to enable/disable features without deployment |
| Infrastructure as Code | Managing infrastructure through version-controlled files |
| Health Check | Endpoint that reports application health |
| Rollback | Reverting to a previous version after failed deployment |
12.13 9.12 Review Questions
Explain the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment.
What are the core practices of Continuous Integration? Why is each important?
Describe the structure of a GitHub Actions workflow. What are workflows, jobs, and steps?
Compare blue-green, canary, and rolling deployment strategies. When would you use each?
Why is “fix broken builds immediately” a critical CI principle?
How do you securely manage secrets in CI/CD pipelines?
What is Infrastructure as Code? What problems does it solve?
Explain the purpose of staging environments. How should they relate to production?
What are the four golden signals of monitoring? Why are they important for deployments?
A deployment succeeds but users report errors. What steps would you take to diagnose and resolve the issue?
12.14 9.13 Hands-On Exercises
12.14.1 Exercise 9.1: Basic CI Pipeline
Create a CI pipeline for your project:
- Create
.github/workflows/ci.yml - Configure triggers for push and pull requests
- Add jobs for:
- Linting
- Unit tests
- Build
- Verify the pipeline runs on a pull request
- Add a README badge showing build status
12.14.2 Exercise 9.2: Matrix Testing
Extend your CI pipeline with matrix builds:
- Test across multiple Node.js versions (18, 20, 22)
- Test on multiple operating systems (ubuntu, windows)
- Add a coverage job that only runs on one combination
- Verify all combinations pass
12.14.3 Exercise 9.3: Automated Deployment
Set up automated deployment to a hosting platform:
- Choose a platform (Vercel, Netlify, Render, or similar)
- Create deployment workflow triggered by main branch
- Add staging environment (deploy on all branches)
- Add production environment with approval requirement
- Document the deployment process
12.14.4 Exercise 9.4: Docker and CI
Containerize your application:
- Create a Dockerfile for your application
- Create docker-compose.yml for local development
- Add Docker build and push to CI pipeline
- Configure caching for faster builds
- Test the container locally and in CI
12.14.5 Exercise 9.5: Health Checks and Monitoring
Implement health checks:
- Add
/health/liveendpoint (basic liveness) - Add
/health/readyendpoint (checks dependencies) - Add health check to Dockerfile
- Configure CI to verify health after deployment
- Document health check responses
12.14.6 Exercise 9.6: Rollback Procedure
Create and test a rollback procedure:
- Create
rollback.ymlworkflow - Accept environment and version as inputs
- Implement rollback logic (revert to previous version)
- Test rollback in staging environment
- Document the rollback procedure
12.14.7 Exercise 9.7: Complete CI/CD Pipeline
Build a complete pipeline integrating all concepts:
- Lint, test, and build on every commit
- Deploy to staging on develop branch
- Deploy to production on main with approval
- Include security scanning
- Post-deployment health verification
- Slack/Discord notification on deployment
- Document the entire pipeline
12.15 9.14 Further Reading
Books:
- Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley.
- Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.
- Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.
Online Resources:
- GitHub Actions Documentation: https://docs.github.com/en/actions
- Docker Documentation: https://docs.docker.com/
- Terraform Documentation: https://www.terraform.io/docs
- Martin Fowler’s CI/CD Articles: https://martinfowler.com/articles/continuousIntegration.html
- Google SRE Book: https://sre.google/sre-book/table-of-contents/
Tools:
- GitHub Actions: https://github.com/features/actions
- Docker: https://www.docker.com/
- Terraform: https://www.terraform.io/
- Kubernetes: https://kubernetes.io/
- ArgoCD: https://argoproj.github.io/cd/
12.16 References
Fowler, M. (2006). Continuous Integration. Retrieved from https://martinfowler.com/articles/continuousIntegration.html
Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley.
Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution Press.
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media.
GitHub. (2024). GitHub Actions Documentation. Retrieved from https://docs.github.com/en/actions