← Containers · advanced · 10 min · 05 / 06 বাংলা

Container Security

Attack surface, capabilities, seccomp, read-only filesystems, and what actually matters for production hardening.

container securitycapabilitiesseccompread-onlyrootlesssupply chain

Real-World Analogy

Defense in depth for a museum: the front door has guards, each wing has alarms, display cases are locked, and the most valuable items are in a vault. No single measure is sufficient — you layer them so that breaching one layer doesn’t give access to everything. Container security is the same: non-root + read-only + capability drop + seccomp, together.

The Threat Model

What you’re defending against in a containerized environment:

Compromised application code — an attacker exploits a bug in your app and gets RCE inside the container
Compromised base image — a malicious or vulnerable upstream image
Container escape — exploiting a kernel or runtime vulnerability to escape the container and reach the host
Privilege escalation — a process inside the container gains capabilities it shouldn’t have
Supply chain attack — a dependency or base image is compromised after you build

Defense in depth: make each of these harder, accept you can’t make any of them impossible.

Non-Root by Default

Running as root inside a container is the most common mistake. If your app is compromised and running as root, the attacker has root — which makes container escapes and lateral movement much easier.

# Most official images have a non-root user — use it
FROM node:20-alpine
# 'node' user (UID 1000) already exists

WORKDIR /app
COPY --chown=node:node . .
RUN npm ci --omit=dev

USER node   # switch before CMD
CMD ["node", "server.js"]

# Verify
docker run --rm myapp id
# uid=1000(node) gid=1000(node) groups=1000(node)

# If a user doesn't exist in the base image, create one
RUN addgroup -g 1001 -S appgroup && \
    adduser -u 1001 -S appuser -G appgroup
USER appuser

Enforce non-root in Kubernetes:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
  containers:
    - name: api
      securityContext:
        allowPrivilegeEscalation: false

Read-Only Root Filesystem

A read-only filesystem prevents an attacker from writing backdoors, modifying binaries, or installing tools:

# Run with read-only root filesystem
docker run --read-only myapp

# If your app needs to write (temp files, logs):
docker run --read-only \
  --tmpfs /tmp:rw,size=100m,noexec \    # RAM-backed writable /tmp
  --tmpfs /app/logs:rw,size=50m \       # writable log dir
  myapp

In Kubernetes:

containers:
  - name: api
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
      - name: tmp
        mountPath: /tmp
volumes:
  - name: tmp
    emptyDir:
      medium: Memory # RAM-backed
      sizeLimit: 100Mi

Most apps need to write somewhere. Audit what your app writes and make it explicit — readOnlyRootFilesystem forces you to enumerate writable paths rather than having the entire filesystem available.

Linux Capabilities

The root user has ~40 distinct capabilities (ability to bind port <1024, kill any process, load kernel modules, etc.). Running as root gives all of them. You can run as root but drop all non-essential capabilities:

# Drop all capabilities, add back only what's needed
docker run \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \    # allow binding port <1024
  myapp

# Common capabilities and when you need them:
# NET_BIND_SERVICE: bind ports < 1024 (prefer using port 3000+ instead)
# CHOWN: change file ownership (needed if app chowns files at startup)
# DAC_OVERRIDE: bypass file permission checks (rarely legitimate)
# SYS_PTRACE: debug other processes (debugging only, never in prod)

The right default: --cap-drop=ALL and add back only what testing proves is needed. Use port 3000+ to avoid needing NET_BIND_SERVICE.

In Kubernetes:

securityContext:
  capabilities:
    drop: [ALL]
    add: [] # empty — your app should run without any special capabilities

seccomp Profiles

seccomp (Secure Computing Mode) filters which syscalls a container can make. The default Docker seccomp profile blocks ~44 dangerous syscalls including ptrace, kexec_load, and mount.

# Default seccomp profile is already applied
docker run --security-opt seccomp=/etc/docker/seccomp.json myapp

# Disable seccomp (don't do this in production)
docker run --security-opt seccomp=unconfined myapp

# Custom profile: only allow syscalls your app actually uses
# Use strace to identify what your app calls, then write a minimal profile
strace -e trace=all -f node server.js 2>&1 | awk -F'(' '{print $1}' | sort -u

For most applications, the default Docker seccomp profile is sufficient. Creating a custom minimal profile for high-security workloads requires significant testing but dramatically reduces attack surface.

Image Supply Chain Security

Use specific digests, not tags:

# WRONG — 'latest' can change to anything
FROM node:20-alpine

# BETTER — specific tag (can still be overwritten)
FROM node:20.11.1-alpine3.19

# BEST — digest is immutable
FROM node:20.11.1-alpine3.19@sha256:bf77dc26e48ea95fca9d1aceb5acfa69d2e546b765ec2abfb502975f1a2d4def

Scan dependencies before build:

# Scan npm dependencies
npm audit --audit-level=high

# Scan Python
pip-audit

# Scan Go
govulncheck ./...

# In CI: fail on high severity
npm audit --audit-level=high --exit-code 1

SBOM (Software Bill of Materials):

# Generate SBOM for your image
syft myapp:latest -o spdx-json > sbom.json

# Verify image signature (if using cosign)
cosign verify ghcr.io/org/myapp:v1.0.0 --certificate-identity=...

Sign your images:

# Sign with cosign (keyless, using OIDC)
cosign sign ghcr.io/org/myapp:v1.0.0

# Verify before deploying
cosign verify ghcr.io/org/myapp:v1.0.0

Secrets Management

Never bake secrets into images:

# WRONG — secret in image layer forever
RUN curl -H "Authorization: Bearer $API_KEY" https://api.service/setup

# WRONG — build arg visible in image history
ARG API_KEY
RUN curl -H "Authorization: Bearer ${API_KEY}" ...

# RIGHT — BuildKit secret mounts (never stored in image)
# --secret id=mykey,src=./secret.txt
RUN --mount=type=secret,id=mykey \
    API_KEY=$(cat /run/secrets/mykey) \
    curl -H "Authorization: Bearer $API_KEY" ...

At runtime: inject secrets via environment variables from a secrets manager, not from .env files in the container:

# Kubernetes: secret from Vault or AWS Secrets Manager
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: password

Security Checklist

□ Run as non-root user (USER instruction + runAsNonRoot: true)
□ Read-only root filesystem + explicit writable tmpfs mounts
□ Drop all Linux capabilities (cap-drop: ALL)
□ Default seccomp profile enabled (Docker default, or custom)
□ No privileged mode (privileged: false)
□ No host network or host PID namespace sharing
□ Images pinned to digest in production
□ Vulnerability scanning in CI (Trivy, Grype)
□ npm audit / equivalent for dependency scanning
□ No secrets in Dockerfile, image layers, or build args
□ Secrets injected at runtime from a secrets manager
□ Network policies limit container-to-container traffic