Debugging a Broken CI Pipeline in GitHub Actions
“It Works on My Machine”
“What 5 CI/CD Failures Taught Us About Shipping Software”
In modern development, CI/CD pipelines are the backbone of shipping fast, safe, and consistent software. We trust them to be reliable, reproducible, and immune to the quirks of individual developer environments.
And yet…
Sometimes your perfectly functional local code hits the CI wall and crumbles. Other times, it sails through the CI — only to break in production. Occasionally, it works in both places — and still does the wrong thing.
This article shares five real-world debugging cases — drawn from production systems — where GitHub Actions CI/CD pipelines broke in surprising ways:
- A Node.js monorepo where environment paths caused broken builds in CI.
- A Python app where a test failed only in the cloud.
- A PHP microservice where CI skipped a critical script step.
- A Go project where a race condition was invisible locally.
- A sneaky Docker cache issue that served outdated code with a smile.
Each case was painful to debug. Each taught us something valuable. And each reinforces the same truth:
CI/CD is software too — and it needs testing, observability, and care like everything else.
Let’s dive in.
🧵 A Node.js Monorepo Where Environment Paths Caused Broken Builds in CI
…and Why “It Works on My Machine” Wasn’t Enough
The Modern Monorepo Setup
We had what looked like a clean, scalable architecture: a Node.js monorepo using Yarn Workspaces to manage multiple internal packages under one repository.
The structure:
/repo
/packages
/core # shared logic & utilities
/web # web frontend using React
/cli # internal CLI tool
package.json # declares "workspaces": ["packages/*"]
Each subdirectory (/web
, /cli
, etc.) was its own package, with dependencies and scripts. The root managed dependencies across the entire workspace. Shared binaries like eslint
, tsc
, and our custom CLI tool lived in devDependencies
at the root.
Locally, things were smooth:
- Developers used tools like
tsc
,eslint
,cli
directly from any workspace. yarn install
hoisted dependencies and created symlinks undernode_modules/.bin/
.- Running tests and builds from any subproject Just Worked™.
But then we shipped the first CI setup using GitHub Actions. And things broke — hard.
The CI Workflow: A Basic Start
Here’s the initial GitHub Actions workflow for CI:
name: Build & Test
on:
push:
branches: [main]
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 18
- name: Install dependencies
run: yarn install --frozen-lockfile
- name: Run tests
run: yarn workspace web test
We assumed this would mimic local development.
Instead, the CI blew up:
error Command failed with exit code 127.
sh: tsc: command not found
This error appeared for various binaries (tsc
, eslint
, cli
, etc.) — all of which worked locally.
The Investigation Begins
At first, we suspected something wrong with yarn install
. But digging deeper, we realized:
- The error wasn’t about missing packages.
- It was about missing binaries — those normally exposed in
node_modules/.bin
.
To confirm, we added a debug step to inspect the $PATH
:
- run: echo $PATH
The result?
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
No node_modules/.bin
in sight.
Meanwhile, tsc
, eslint
, etc., were symlinked correctly under:
./node_modules/.bin/
So why were they invisible?
Root Cause: PATH Assumptions and Subshells
There were two main problems, both related to assumptions developers often make when working locally.
1. GitHub Actions Runs Commands in Fresh Subshells
Each run:
block in GitHub Actions executes in its own non-interactive shell. That means:
- Your
PATH
isn’t persistent across steps unless you export it explicitly. - Changes to the environment in one step don’t survive into the next.
Unlike your terminal, which remembers when you export PATH
, CI is stateless unless told otherwise.
2. Monorepo Layout Obscures Context
Within a monorepo, binaries often exist only at the root. If you cd
into a workspace folder like packages/web
, node_modules/.bin
is not there — unless the workspace has its own install (which ours didn’t, by design).
So when CI moved into packages/web
to run tests, it no longer had access to shared binaries unless they were explicitly referenced.
Testing Hypotheses
We tried changing the working directory in the run
block:
- name: Debug tsc
working-directory: packages/web
run: |
ls ../../node_modules/.bin
which tsc
Still:
which: no tsc in (...)
tsc
was present at ../../node_modules/.bin/tsc
, but it wasn’t in the $PATH
.
We even tried calling the binary directly:
- run: ../../node_modules/.bin/tsc
This worked — but hardcoding paths wasn’t sustainable across multiple workspaces and environments.
The Fix: Explicitly Extend PATH
The most portable fix was to add the workspace’s .bin
directory to PATH
using GitHub Actions’ environment manipulation feature:
- name: Add node_modules/.bin to PATH
run: echo "$/node_modules/.bin" >> $GITHUB_PATH
Now, every shell in the job has access to binaries like tsc
, eslint
, and cli
— no matter which directory we run from.
We also simplified test calls by running them from the root, not per workspace:
- run: yarn workspace web test
This ensured the scripts inherited the correct environment and binary access.
What We Learned
-
Your local terminal is lying to you. Tools like VS Code,
direnv
, and your shell profile (.zshrc, .bashrc) help you without you realizing it. CI is a clean room — it only runs what you tell it. -
Monorepos introduce binary scope issues. Binaries installed at the root may not be visible in subfolders unless the path is adjusted.
-
GitHub Actions doesn’t persist shell environments across steps. To make your pipeline reproducible, always treat each
run:
step as a new shell. -
Explicit is better than implicit. Add
node_modules/.bin
to thePATH
. Prefer fully-qualified script calls. Avoid relying on magical behavior fromyarn
.
Final Working CI Snippet
Here’s the corrected and robust version of the CI workflow:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v4
with:
node-version: 18
- run: yarn install --frozen-lockfile
- name: Add node_modules/.bin to PATH
run: echo "$/node_modules/.bin" >> $GITHUB_PATH
- name: Run tests
run: yarn workspace web test
TL;DR
If your monorepo CI pipeline fails with “command not found” for a binary you know is installed — check your
PATH
. CI isn’t ignoring your code. It’s just following exactly what you told it to do.
🐍 A Python App Where a Test Failed Only in the Cloud
…and How We Discovered the OS Knew More Than We Did
The Setup
We had a small but growing Python web app that served as an internal API for reporting and metrics. The stack was lean:
- Flask for HTTP routing
- SQLAlchemy for the ORM
- pytest for tests
- GitHub Actions for CI
Developers ran the full test suite locally before pushing to main. CI validated every pull request and ensured quality before merging.
Everything worked.
Until one day, a contributor pushed a simple patch — a few lines touching a timestamp formatting function — and CI failed:
=================================== FAILURES ===================================
___________ test_format_timestamp_with_tz[test_input_1-expected_output_1] ______
> assert format_timestamp("2023-10-15T13:45:30Z") == "2023-10-15 13:45 UTC"
E AssertionError: assert '2023-10-15 15:45 UTC' == '2023-10-15 13:45 UTC'
Locally? All green.
“It Works on My Machine”
The developer confirmed:
pytest tests/test_time_utils.py
Passed on:
- macOS
- Windows WSL
- Ubuntu 22.04 (Docker)
But in CI — ubuntu-latest
on GitHub Actions — the test failed. Every time.
The Test in Question
The failing test checked a simple utility:
def format_timestamp(iso_str):
dt = datetime.fromisoformat(iso_str.replace("Z", "+00:00"))
return dt.strftime("%Y-%m-%d %H:%M UTC")
And the test:
@pytest.mark.parametrize("test_input,expected", [
("2023-10-15T13:45:30Z", "2023-10-15 13:45 UTC"),
])
def test_format_timestamp_with_tz(test_input, expected):
assert format_timestamp(test_input) == expected
So… why was CI formatting the time as 15:45, not 13:45?
The Deep Dive
We added a debug print:
print(repr(dt)) # datetime object
In CI:
datetime.datetime(2023, 10, 15, 15, 45, 30, tzinfo=datetime.timezone.utc)
Wait — 15:45?
Locally, the exact same line produced:
datetime.datetime(2023, 10, 15, 13, 45, 30, tzinfo=datetime.timezone.utc)
Same code. Same input string. Different result.
We added more debug output in CI:
import os, time
print("TZ ENV:", os.environ.get("TZ"))
print("System TZ:", time.tzname)
Output:
TZ ENV: None
System TZ: ('UTC', 'UTC')
Which… made it weirder. CI claimed it was in UTC. But the timestamp result was 2 hours ahead.
The Root Cause: A Library’s Dirty Secret
We eventually isolated the issue:
datetime.fromisoformat
was behaving inconsistently across platforms.
We confirmed:
dt = datetime.fromisoformat("2023-10-15T13:45:30+00:00")
Was interpreted correctly in all environments, but:
dt = datetime.fromisoformat("2023-10-15T13:45:30Z")
Was silently misparsed on some OS/Python combinations, including:
- Ubuntu 22.04
- Python 3.10.12 (used by
ubuntu-latest
) - CI environments without explicit locale/timezone setups
Why?
Because fromisoformat()
does not support the “Z” suffix in Python <3.11. It’s a known quirk:
Z
is short for “Zulu time” (UTC)- Python’s
fromisoformat()
never supported it until Python 3.11+
In our function, we manually replaced "Z"
with "+00:00"
, but in CI the string must have been already altered by a dependency, or inconsistently parsed. Possibly even influenced by a lower-level locale or dateutil
conflict.
Bottom line: you can’t trust fromisoformat("...Z")
before 3.11.
The Fix
We swapped out fromisoformat()
entirely and used dateutil.parser
, which handles Z
and other ISO formats properly:
from dateutil import parser
def format_timestamp(iso_str):
dt = parser.isoparse(iso_str)
return dt.strftime("%Y-%m-%d %H:%M UTC")
And pinned it in requirements.txt
:
python-dateutil>=2.8.2
Now the result was consistent across:
- Local environments
- Docker
- GitHub Actions CI
- macOS vs Linux
Tests passed. CI green.
What We Learned
-
Python standard library support for date formats is fragmented.
datetime.fromisoformat()
seems robust — until it quietly fails on “Z” time. You won’t notice until your app hits CI or production. -
Dependencies can alter behavior in invisible ways. An innocent helper may mutate or sanitize strings before you parse them, changing how you interpret timezone offsets.
-
CI catches edge cases local machines hide. Your laptop has locale settings, system libraries, and Python patches that may not exist in the cloud. Let CI test the real world.
-
Reproducibility needs isolation. Once this failed, we moved all test runs to Docker with fixed versions and timezone configs — even locally — to prevent silent mismatches.
Final CI Fix (Plus Docker)
We added a reproducible test stage with timezone lock:
jobs:
test:
runs-on: ubuntu-latest
container:
image: python:3.10-slim
env:
TZ: UTC
steps:
- uses: actions/checkout@v3
- run: pip install -r requirements.txt
- run: pytest
This locked the timezone to UTC at the container level — and ensured the Python runtime acted predictably.
TL;DR
Your local environment is lying to you — again. Timezones, locales, and Python quirks make CI an essential gate for catching subtle platform bugs.
If your Python test passes on your machine but fails in CI, look at:
- System timezones
- Python versions
- Locale / encoding configs
- And especially: how you parse dates
🐘 A PHP Microservice Where CI Skipped a Critical Script Step
…and the Deployment That Quietly Broke Everything
The Background
We had a lightweight PHP microservice responsible for validating user-submitted data before forwarding it to our backend system. It didn’t do anything too fancy — just some symfony/validator
rules, logging, and response shaping.
CI/CD was managed through GitHub Actions, with a workflow file that handled:
- Linting via
phpcs
- Running PHPUnit tests
- Packaging the service into a Docker container
- Pushing the image to AWS ECR
We’d deployed this microservice dozens of times without issue. But then, one Friday afternoon…
A Silent Breakage
An engineer added a new validation rule and updated the unit tests. Locally:
composer test
✅ All tests passed.
✅ CI. All green.
✅ Deployment to staging. No errors in the pipeline.
But QA flagged an issue:
“The validation step isn’t triggering anymore. We can submit obviously invalid data, and it goes through just fine.”
Our validation microservice was accepting nonsense — empty strings, negative values, invalid email formats — all passed through like water through a sieve.
Reproducing Locally
On local machines:
composer install
composer test
php -S localhost:8000 -t public
Everything behaved as expected.
The new validation rule rejected bad input. It wasn’t a logic bug.
But the deployed version was doing… nothing.
The Mystery
We SSH’d into the ECS container.
Checked the PHP version: ✅
Checked the container’s environment: ✅
Pulled up logs: ✅
No errors.
So we added logging inside the validator service:
file_put_contents('/tmp/validate.log', json_encode($input));
Nothing was being logged.
We confirmed: the validator class was never being invoked. It was being autoloaded, but its constructor wasn’t running.
That led us back to one of the most innocuous files in PHP apps: composer.json
.
The Culprit
We noticed this in composer.json
:
"scripts": {
"post-install-cmd": [
"@php bin/setup-validation-metadata"
]
}
That script populated metadata used by the validator — without it, the system had no rules to run.
Locally, this worked because developers always ran:
composer install
Which triggered post-install-cmd
.
But in CI, our Dockerfile looked like this:
COPY . /var/www/
RUN composer install --no-dev --no-scripts
💥
--no-scripts
was skipping the critical setup step — silently.
There were no warnings. No errors. Just… skipped behavior. And since the validator didn’t throw errors when misconfigured, we didn’t notice until QA found it by hand.
The Fix
We adjusted the Dockerfile
:
RUN composer install --no-dev && \
composer run post-install-cmd
Alternatively, if using Composer 2.2+, you can whitelist individual scripts even with --no-scripts
:
composer install --no-dev --no-scripts
composer run-script setup-validation-metadata
This ensured our validation config was rebuilt every time — just like in local dev.
The Follow-up
To prevent this type of bug from going unnoticed again, we added:
-
A smoke test to assert that validation rules were loaded:
$this->assertNotEmpty($validator->getMetadataFor(MyRequestDto::class)->getConstraints());
-
An end-to-end test that actually hit the service over HTTP in CI with bad data.
-
A linting rule in CI to warn if
composer.json
has post-install scripts and CI uses--no-scripts
.
Lessons Learned
-
CI scripts may silently skip critical project logic. Flags like
--no-scripts
are great for speed — but dangerous if you rely on Composer hooks. -
What CI skips, it also skips silently. Unlike failed tests or missing files, Composer won’t yell if your scripts are skipped.
-
Validate your assumptions, not just your code. If your code relies on setup steps, test that they ran — not just that “no errors” occurred.
-
Production != Dev Especially in Docker-based CI, every flag or layer cache decision can affect runtime behavior.
Final GitHub Actions Fix
In our ci.yml
, we updated the build stage:
- name: Install dependencies
run: composer install --no-dev
- name: Run validation metadata script
run: composer run post-install-cmd
And added a CI test:
- name: Smoke test validator
run: php tests/ValidateMetadataSmokeTest.php
TL;DR
Your CI pipeline might run green while skipping half the setup — and PHP won’t raise its voice about it.
Always check:
- Do you rely on
composer
scripts? - Are they skipped in
--no-scripts
mode? - Do your tests actually hit runtime behavior, or just isolated logic?
CI can pass while your app does nothing at all.
🦫 A Go Project Where a Race Condition Was Invisible Locally
…Until CI Flushed It Out Under Pressure
The Setup
We had a backend Go microservice responsible for aggregating metrics from various upstream APIs. The service polled data every few seconds, cached responses, and exposed a JSON endpoint for downstream consumers.
Architecture-wise:
- Scheduled fetch via
time.Ticker
- Caching in a shared in-memory map
- Served via
net/http
- Tested with Go’s built-in
testing
package - Built and deployed via GitHub Actions
It was fast. It was reliable. It was… not as thread-safe as we thought.
The CI Bug That Shouldn’t Exist
A team member submitted a pull request that reorganized how we fetched and cached data:
var dataCache = make(map[string]MetricData)
func updateMetrics() {
resp := fetchRemoteMetrics()
for key, value := range resp {
dataCache[key] = value
}
}
Tests passed locally. Coverage was high. CI ran all tests — and intermittently failed with errors like:
fatal error: concurrent map writes
Sometimes CI was green. Sometimes it crashed. Even rerunning the same commit would randomly fail.
Locally, it was solid as a rock.
The Confusion
Why was CI catching errors that never happened locally?
We tried:
go test -race ./...
Locally, all tests passed — even with -race
.
CI? Still threw the concurrent map writes
panic occasionally, but not every run.
That’s when we noticed something subtle:
- CI machines used 4+ vCPUs
- Devs typically ran on 1-2 cores
The concurrency window was simply wider in CI. Race conditions that were possible locally were just too rare to manifest.
The Offending Code
Here’s the simplified version of what we had:
var dataCache = map[string]MetricData{}
func fetchAndUpdate() {
newData := fetchRemoteMetrics()
for k, v := range newData {
dataCache[k] = v
}
}
func metricsHandler(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(dataCache)
}
At the same time:
- A background goroutine called
fetchAndUpdate()
every 5 seconds. - HTTP handlers read from
dataCache
.
The result? Concurrent reads and writes to a plain map
.
Go maps are not thread-safe. Ever.
But since most machines don’t hit both read and write in the same nanosecond, we got away with it — until CI ran things fast enough to expose it.
The Fix
We switched to sync.Map
— Go’s concurrency-safe map type.
var dataCache sync.Map
func fetchAndUpdate() {
newData := fetchRemoteMetrics()
for k, v := range newData {
dataCache.Store(k, v)
}
}
func metricsHandler(w http.ResponseWriter, r *http.Request) {
cache := make(map[string]MetricData)
dataCache.Range(func(key, value any) bool {
cache[key.(string)] = value.(MetricData)
return true
})
json.NewEncoder(w).Encode(cache)
}
Now all concurrent access was safe, and we no longer relied on race-prone native maps.
CI Defense
We added a dedicated CI step:
- name: Run tests with race detector
run: go test -race ./...
We also updated local tooling:
alias gotest='go test -race ./...'
And added this to Makefile
to enforce dev use:
test:
go test -race ./...
What We Learned
-
Race conditions may exist locally but only trigger under CI conditions. More cores = more parallelism = more chaos = more bugs.
-
Go’s map is a time bomb under concurrent access. Even if you’re only writing once every few seconds, you must guard map access when reads and writes happen in parallel.
-
The
-race
flag is essential — but not sufficient alone. Use it with parallel tests, stress tests, and in environments that actually mimic real deployment concurrency. -
Don’t just rely on unit tests — test usage under load. In Go, even a tiny web service needs concurrency safety from day one.
TL;DR
If your Go code uses a
map
, and goroutines can possibly read/write it at the same time, you’re living dangerously.
CI exposed a race condition that our local machines didn’t. Why?
- Local: 1-2 cores, low parallelism
- CI: multiple cores, full CPU usage, triggers bugs faster
Fix:
- Replace
map
withsync.Map
(or usesync.RWMutex
) - Always test with
go test -race
- Add concurrent access tests in CI to simulate real usage
🐳 A Sneaky Docker Cache Issue That Served Outdated Code with a Smile
…and How CI Deployed an Old Binary While Saying “Everything’s Fine”
The Background
We had a simple Go service containerized with Docker. The app built a binary, packaged it into a small Alpine image, and exposed a /health
and /version
endpoint.
Here’s our Dockerfile
:
FROM golang:1.20 AS builder
WORKDIR /app
COPY . .
RUN go build -o server .
FROM alpine:3.18
WORKDIR /app
COPY --from=builder /app/server .
CMD ["./server"]
Our GitHub Actions workflow looked like:
- name: Build Docker image
run: docker build -t my-app:$ .
- name: Push Docker image
run: docker push my-app:$
- name: Deploy to staging
run: some-deployment-script.sh
Each push triggered a new CI run, built a Docker image, and deployed to staging.
And for months — it worked flawlessly.
Until one week, it didn’t.
The Incident
A small refactor added a new /metrics
endpoint.
The PR was reviewed. Merged. CI was green.
Staging was updated. QA pinged the new endpoint:
curl https://staging.example.com/metrics
And got:
404 page not found
They checked the /version
endpoint. It responded with the previous commit hash.
The deployment had gone through — but the new binary hadn’t made it in.
CI logs claimed success. Docker image pushed. Deployment complete.
But the container was serving yesterday’s binary.
The Investigation
We pulled the image directly from the registry:
docker run --rm my-app:<latest-sha> /app/server --version
Still showed the old SHA.
So we re-ran the build job with --no-cache
:
docker build --no-cache -t my-app:debug .
Now it showed the correct version.
Lightbulb moment.
We’d been using Docker layer cache — and it had silently re-used an outdated build step.
The Root Cause
In this line:
COPY . .
RUN go build -o server .
The COPY . .
step didn’t trigger a cache bust unless file timestamps or content changed — and in our case, Docker thought nothing had changed.
Here’s what happened:
- Developer made changes
- Git added the changes
- But the Docker layer cache, running on CI runners, still had an old
COPY
layer cached due to persistent volume caching go build
ran on unchanged files- Result: Old binary copied into the final image, and no one noticed
CI passed. Tests passed. Even image push succeeded.
Just… with the wrong code inside.
The Fix
We modified the Dockerfile
to ensure cache busting on the actual source code:
Solution A: Use checksums as cache key
COPY go.mod ./
COPY go.sum ./
RUN go mod download
COPY . .
RUN go build -o server .
This at least makes changes in go.mod
invalidate cache before COPY . .
.
But more robust:
Solution B: Force cache busting with --no-cache
In CI:
- name: Build Docker image (no cache)
run: docker build --no-cache -t my-app:$ .
Or:
Solution C: Use content hashing as build arg
ARG CACHEBUST=1
COPY . .
RUN go build -o server .
Then in CI:
- run: docker build --build-arg CACHEBUST=$(date +%s) -t my-app:$ .
This ensures a fresh build every time, even if Docker’s cache thinks otherwise.
CI/CD Improvements
To prevent this type of issue again:
-
We added a
/buildinfo
endpoint that exposed:- Git SHA
- Build time
- Docker image tag
-
We diff the deployed hash vs the latest commit hash in post-deploy validation
-
We disabled persistent layer caching in CI runners unless explicitly enabled
-
We lint for
COPY . .
as an anti-pattern in large multi-stage builds
Lessons Learned
-
Docker is fast because it assumes your files didn’t change. If its cache logic says “same inputs”, it skips steps — even if you’re relying on side effects like fresh binaries.
-
CI/CD isn’t just about green lights. It’s about verifying that what you built is what you deployed.
-
Reproducibility ≠ correctness. You might be consistently building the wrong image — and not even know it.
-
Don’t trust Docker layer cache in critical production builds. Use
--no-cache
or proper hashing to bust stale layers.
TL;DR
Docker silently served an old binary because we reused the build layer. CI passed. QA failed. Debugging took hours.
Fixes:
- Use
--no-cache
in CI builds - Add
/version
or/buildinfo
to your app - Validate deploy hash != last commit hash
- Avoid
COPY . .
without explicit cache busting strategies
🧭 CI/CD as a First-Class Citizen in Your Debugging Mindset
These stories highlight a simple but often neglected truth: your pipeline is part of your product.
It builds your code. It decides what gets tested. It decides what gets shipped. And it’s subject to all the same bugs, assumptions, and regressions as the code it delivers.
So the next time you debug a failure, ask not only:
- “What changed in the code?”
But also:
- “What changed in the pipeline?”
- “What assumptions am I making about CI behavior?”
- “What’s different between CI and local?”
Good CI pipelines are invisible when they work. Great engineers investigate them when they don’t.
#CICD #Troubleshooting #30DaysOfDevOps