Profiling
Node.js --prof, Go pprof, Linux perf, and flamegraphs — finding where the time actually goes before guessing at optimizations.
Real-World Analogy
A doctor ordering tests before prescribing: you don’t prescribe antibiotics before knowing whether the infection is bacterial. You don’t optimize a function before knowing it’s the bottleneck. Profiling is the test. “I think the problem is in the database layer” is a hypothesis; a flamegraph is evidence.
The Rule: Profile Before Optimizing
Guessing is expensive. The bottleneck is almost never where you expect:
- The function you think is slow is often called once; the real bottleneck is called 10,000 times
- The slow path is often in a library you didn’t write
- The problem is often memory pressure causing GC pauses, not the actual computation
Profile first. Always.
Node.js CPU Profiling
# Built-in V8 profiler — runs in production safely
node --prof server.js
# After collecting (run load: ab -n 10000 -c 100 http://localhost:3000/)
# Ctrl-C to stop, generates isolate-*.log
# Process the profile
node --prof-process isolate-*.log > profile.txt
cat profile.txt Output:
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
12943 47.2% node:internal/buffer
8234 30.0% /app/src/serialization.js:45:serialize
3421 12.5% node:crypto 48% of CPU time in buffer operations → look at serialization code.
Flamegraph from V8
# 0x — better than raw --prof-process
npm install -g 0x
# Run with profiling
0x -o flamegraph.html -- node server.js
# Or collect against running process
0x --collect-only -o profile/ -- node server.js
# ... run load test ...
# Ctrl-C → generates flamegraph.html
open flamegraph.html The flamegraph shows call stacks. Wide bars = more CPU time. Tall stacks = deep call chains. You want to find the wide bars at the top — those are the actual CPU consumers.
Node.js Memory Profiling
// Heap snapshot — good for memory leaks
import v8 from 'v8';
import fs from 'fs';
// In production, expose via a protected endpoint
app.get('/debug/heap-snapshot', (req, res) => {
const filename = `/tmp/heap-${Date.now()}.heapsnapshot`;
const snapshot = v8.writeHeapSnapshot(filename);
res.download(filename);
}); Load in Chrome DevTools → Memory → Load snapshot. Look for objects that shouldn’t be alive, or that accumulate over time.
Detecting memory leaks in production:
import { setInterval } from 'timers';
// Log heap usage every 30 seconds
setInterval(() => {
const mem = process.memoryUsage();
log.info({
heapUsed: Math.round(mem.heapUsed / 1024 / 1024),
heapTotal: Math.round(mem.heapTotal / 1024 / 1024),
rss: Math.round(mem.rss / 1024 / 1024),
external: Math.round(mem.external / 1024 / 1024),
}, 'Memory usage');
}, 30_000); If heapUsed grows monotonically over hours without plateauing: memory leak.
Go pprof
Go has profiling built into the standard library:
import (
"net/http"
_ "net/http/pprof" // registers /debug/pprof/* handlers
"runtime"
)
func main() {
// Enable block and mutex profiling (disabled by default)
runtime.SetBlockProfileRate(1) // profile all blocking events
runtime.SetMutexProfileFraction(1)
// pprof HTTP server (separate from your app port — protect this!)
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ... your app
} # Collect 30s CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Inside pprof:
(pprof) top10 # top 10 functions by CPU time
(pprof) web # open flamegraph in browser (requires graphviz)
(pprof) list myFunc # annotated source for a function
# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top10 -cum # cumulative allocations
# Goroutine profile — detect goroutine leaks
go tool pprof http://localhost:6060/debug/pprof/goroutine
(pprof) top # goroutines by count
# Block profile — where goroutines block waiting
go tool pprof http://localhost:6060/debug/pprof/block Go Flamegraph
# Using pprof's built-in HTTP server
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# Opens browser with flamegraph, call tree, top functions Benchmarks
Benchmarks in Go are first-class:
// order_test.go
func BenchmarkProcessOrder(b *testing.B) {
order := generateTestOrder()
b.ResetTimer() // don't count setup time
b.RunParallel(func(pb *testing.PB) { // parallel benchmark
for pb.Next() {
if err := processOrder(order); err != nil {
b.Fatal(err)
}
}
})
} # Run benchmark with CPU and memory profiling
go test -bench=BenchmarkProcessOrder -benchmem \
-cpuprofile=cpu.prof \
-memprofile=mem.prof \
-benchtime=10s
# Output:
# BenchmarkProcessOrder-8 234156 5124 ns/op 1024 B/op 12 allocs/op
# Analyze
go tool pprof cpu.prof allocs/op is critical — allocations trigger GC. Reduce allocations to reduce GC pressure.
Linux perf
For system-level profiling (C extensions, JVM internals, kernel calls):
# CPU profile for 30 seconds
perf record -g -F 99 -p $(pgrep node) -- sleep 30
perf report # TUI report
perf report --stdio # text output
# Flamegraph from perf
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl > flamegraph.svg -g enables call graph (stack traces). -F 99 = 99 samples/sec (avoids interference with 100Hz timer).
Useful perf commands:
# What system calls is the process making?
strace -p <pid> -c # count system calls
# Is the process blocked on I/O?
perf stat -e 'block:*' -p <pid>
# Cache misses (memory bottleneck)
perf stat -e cache-misses,cache-references,instructions,cycles -p <pid>
# System-wide top (like top but with CPU cycles)
perf top Async Performance in Node.js
The event loop is single-threaded. Blocking the event loop blocks all requests.
# Measure event loop lag (blocked = slow I/O or CPU)
npm install -g @nicolo-ribaudo/clinic
clinic doctor -- node server.js // Detect event loop lag in code
import { monitorEventLoopDelay } from 'perf_hooks';
const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();
setInterval(() => {
log.info({
p50: h.percentile(50) / 1e6, // convert nanoseconds to ms
p99: h.percentile(99) / 1e6,
max: h.max / 1e6,
}, 'Event loop delay');
h.reset();
}, 10_000); P99 event loop delay > 100ms = something is blocking the loop. Common culprits:
- JSON.parse on large payloads (synchronous, blocking)
- Crypto operations (use
crypto.subtleasync or worker threads) - Large array sorts or regex on big strings
- Synchronous file system calls (
fs.readFileSync)
Offload CPU work:
import { Worker, isMainThread, parentPort, workerData } from 'worker_threads';
// Worker thread for CPU-intensive work
if (!isMainThread) {
const result = heavyCpuWork(workerData.input);
parentPort!.postMessage(result);
process.exit(0);
}
function runInWorker(input: any): Promise<any> {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, { workerData: { input } });
worker.on('message', resolve);
worker.on('error', reject);
});
} Profiling in Production
Profiling in production is different from dev — you need low overhead:
# Node.js: continuous profiling with 0% overhead using V8's sampling profiler
# (sampling at 1ms intervals — negligible overhead)
node --prof server.js &
kill -USR2 $(cat server.pid) # dump profile without stopping process Pyroscope — continuous profiling service (open source):
import Pyroscope from '@pyroscope/nodejs';
Pyroscope.init({
serverAddress: 'http://pyroscope:4040',
appName: 'order-service',
});
Pyroscope.start(); Pyroscope samples CPU at 100Hz continuously, aggregates, and lets you query “what was the CPU doing between 14:00 and 14:05 yesterday?” — invaluable for post-incident analysis.