1. Background overview
Container mirroring is the first step in the transformation of container landing. Summarize several reasons for image optimization
With the large-scale migration of application containerized deployments and accelerated version iteration, docker mirroring for optimizing infrastructure has the following main purposes
-
Reduce mirror download time on deployment
-
Enhance security and reduce targets available for attack
-
Reduce recovery time
-
Save storage overhead
2. Why is the mirror so large?
Here, several typical Repo s are briefly analyzed, and several reasons why existing Docker images are larger are summarized.
2.1 Base image is too large
Example: Warehouse A, the resulting mirror size is 9.67 GB
Basic mirror used: 8.72GB image size
On the contrary, why is the base image so large? No more 0.0 results
2.2 Base image is too large and missing
Example: Warehouse B, the resulting mirror size is 22.7GB
Basic mirror used: 404 not found, yes, 0.0 not found
2.3.git directory (unnecessary directory)
More on this issue can be found in my previous articles Why is the Git directory so large
Example: Warehouse C, code size 795MB
Where the.git directory is 225MB in size, the instructions in the dockerfile are as follows (all added to the mirror)
ADD . /app/startapp/
It also contains the d directory, which is about 300MB in size, and whether or not it needs to be used is unknown, but it is not needed visually, just for testing data.
d ├── [ 503] test_421.json ├── [ 483] test_havalB9.json ... ├── [ 484] test_144.json ├── [ 104] .gitmodules ├── [ 122] .idea ├── [ 0] __init__.py ├── [ 11M] 164103.zip ├── [108M] test_180753.csv ├── [ 68M] test_180753.txt ... └── [ 335] README.md
None of the above actually needs to be submitted to the mirror for mirroring
2.4 Dockerfile itself has other problems
It goes without saying that Dockerfile s written by non-professionals may have some room for optimization, but these details are just not being looked at for the moment
For example, let each repo research and development write its own Dockerfile, without a certain standard, it may be indifferent in the early stage, but problems will gradually emerge in the later stage
Just so-called "Can Do It"~
3. How to optimize Dockerfile
3.1 Where to Start
Optimizing docker image should start with the concept of mirror hierarchy
3.1.1 Raise a chestnut
A practical example
nginx:alpine mirror 23.2MB
# docker history nginx:alpine IMAGE CREATED CREATED BY SIZE COMMENT b46db85084b8 9 days ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemon... 0B <missing> 9 days ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B <missing> 9 days ago /bin/sh -c #(nop) EXPOSE 80 0B <missing> 9 days ago /bin/sh -c #(nop) ENTRYPOINT ["/docker-entr... 0B <missing> 9 days ago /bin/sh -c #(nop) COPY file:09a214a3e07c919a... 4.61kB <missing> 9 days ago /bin/sh -c #(nop) COPY file:0fd5fca330dcd6a7... 1.04kB <missing> 9 days ago /bin/sh -c #(nop) COPY file:0b866ff3fc1ef5b0... 1.96kB <missing> 9 days ago /bin/sh -c #(nop) COPY file:65504f71f5855ca0... 1.2kB <missing> 9 days ago /bin/sh -c set -x && addgroup -g 101 -S ... 17.6MB <missing> 9 days ago /bin/sh -c #(nop) ENV PKG_RELEASE=1 0B <missing> 9 days ago /bin/sh -c #(nop) ENV NJS_VERSION=0.7.0 0B <missing> 9 days ago /bin/sh -c #(nop) ENV NGINX_VERSION=1.21.4 0B <missing> 9 days ago /bin/sh -c #(nop) LABEL maintainer=NGINX Do... 0B <missing> 10 days ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B <missing> 10 days ago /bin/sh -c #(nop) ADD file:762c899ec0505d1a3... 5.61MB
python:alpine mirror 45.5MB
# docker history python:alpine IMAGE CREATED CREATED BY SIZE COMMENT 382a63bb2f25 10 days ago /bin/sh -c #(nop) CMD ["python3"] 0B <missing> 10 days ago /bin/sh -c set -ex; wget -O get-pip.py "$P... 8.31MB <missing> 10 days ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_SHA256... 0B <missing> 10 days ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_URL=ht... 0B <missing> 10 days ago /bin/sh -c #(nop) ENV PYTHON_SETUPTOOLS_VER... 0B <missing> 10 days ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=21... 0B <missing> 10 days ago /bin/sh -c cd /usr/local/bin && ln -s idle3... 32B <missing> 10 days ago /bin/sh -c set -ex && apk add --no-cache --... 29.8MB <missing> 10 days ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.10.0 0B <missing> 10 days ago /bin/sh -c #(nop) ENV GPG_KEY=A035C8C19219B... 0B <missing> 10 days ago /bin/sh -c set -eux; apk add --no-cache c... 1.82MB <missing> 10 days ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B <missing> 10 days ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/... 0B <missing> 10 days ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B <missing> 10 days ago /bin/sh -c #(nop) ADD file:762c899ec0505d1a3... 5.61MB
Actual Storage
# docker inspect nginx:alpine| jq '.[0]|{GraphDriver}' { "GraphDriver": { "Data": { "LowerDir": "/data/docker-overlay2/overlay2/3d.../diff:/data/docker-overlay2/overlay2/ae.../diff:/data/docker-overlay2/overlay2/ea.../diff:/data/docker-overlay2/overlay2/29.../diff:/data/docker-overlay2/overlay2/5e.../diff", "MergedDir": "/data/docker-overlay2/overlay2/b7.../merged", "UpperDir": "/data/docker-overlay2/overlay2/b7.../diff", "WorkDir": "/data/docker-overlay2/overlay2/b7.../work" }, "Name": "overlay2" } }
Description of the concept of hierarchy
Mirroring solves the problem of packaging applications and environments. Applications in practice are packaged and iterated based on the same rootfs, but not every rootfs has more than one. In fact, docker implements hierarchy using storage-driven AUFS, devicemapper, overlay, overlay2 storage technology
For example, if you look at a docker image above, you will see these layers
-
LowerDir: Mirror Layer
-
MergedDir: A view that integrates the lower and upper layers
-
UpperDir: Read and Write Layer
-
WorkDir: Middle layer, write to Upper layer, write to WorkDir first, move to UpperDir
3.1.2 Copy on write
When Docker first starts a container, the initial read-write layer is empty, and when the file system changes, these changes apply to it. For example, if you want to modify a file, it will first be copied from the read-only layer beneath the read-write layer to the read-write layer. Thus, the read-only version of the file still exists in the read-only layer, but is hidden by the copy of the file in the read-write layer, which is called write-time replication.
3.1.3 UnionFS
Mount the contents of multiple directories (also called branches) together into the same directory, where the physical locations are separate
For an intuitive effect, it is much faster to pull a nginx:1.15 image for the first time and a nginx:1.16 image for the second time
3.2 Program
Once you understand the main components of mirror size, it's easy to know in which direction to start reducing it
3.2.1 Reduce the number of mirroring layers
The increase in the number of mirroring layers is mainly due to the number of occurrences of RUN directives for Dockerfile, so merging RUN directives can significantly reduce the number of mirroring layers
Lift a chestnut:
Before merging, three layers
RUN apk add tzdata RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime RUN echo "Asia/Shanghai" > /etc/timezone
After merging, one level
RUN apk add tzdata \ && cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \ && echo "Asia/Shanghai" > /etc/timezone
3.2.2 Reduce the size of each mirror layer
3.2.2.1 Choose a smaller base image
- Scratch: an empty mirror, also known as the father of mirrors! Any mirror needs a basic mirror, so the question comes, like whether there was a chicken or an egg first. What is the "ancestor" of the basic mirror? Can you build without any mirrors? The answer is yes. You can choose scratch instead of expanding it. For reference: baseimages , an example of using scratch mirrors pause
- busybox: compared to scratch, more commonly used linux tools, etc.
- alpine: more package management tools such as apk
3.3.2.2 Multi-stage Construction
Multistage builds are well suited for compiled languages, simply by allowing multiple FROM directives to appear in a Dockerfile. Only the base mirror specified in the last FROM directive serves as the base mirror for this build image. Other stages can be considered intermediate steps only.
Combining FROM...AS...and COPY--from
For example, a java image with a mirror size of 812MB
FROM centos AS jdk COPY jdk-8u231-linux-x64.tar.gz /usr/local/src RUN cd /usr/local/src && \ tar -xzvf jdk-8u231-linux-x64.tar.gz -C /usr/local
Using a multistage build with a mirror size of 618MB
FROM centos AS jdk COPY jdk-8u231-linux-x64.tar.gz /usr/local/src RUN cd /usr/local/src && \ tar -xzvf jdk-8u231-linux-x64.tar.gz -C /usr/local FROM centos COPY --from=jdk /usr/local/jdk1.8.0_231 /usr/local
3.3.2.3 Ignore Files
Build the context build context, meaning the context associated with the work you are doing now
The current working directory at docker build time. By default, files and directories in this context are sent to Docker Daemon as building context content, regardless of whether or not they are used in the current directory at build time.
When docker build starts executing, the console outputs Sending build context to Docker daemon xxxMB, which means that both the files and directories in the current working directory are used as the build context
As mentioned earlier, you can add--no-cache to an RUN directive that does not use caching, or you can add it when you execute the docker build command to avoid caching when you build a mirror
In a build context, using the.dockerignore file prevents local modules and debug logs from being copied into the Docker image at build time, much like git version-controlled.gitignore.
3.3.2.4 Remote Download
Use remote download instead of ADD to reduce image size
RUN curl -s http://192.168.1.1/repository/tools/jdk-8u241-linux-x64.tar.gz | tar -xC /opt/
3.3.2.5 Split COPY
For example, directory A of a COPY directive has four subdirectories AA/BB/CC/DD that are COPY, but only one BB is constantly changing
Splitting COPY will be faster at this time
COPY A/AA /app/A/AA COPY A/BB /app/A/BB COPY A/CC /app/A/CC COPY A/DD /app/A/DD
Mount at 3.3.2.6 build time
Mount on build ( Extended functionality)
To configure
- Modify docker startup parameters and add--experimental
- Dockerfile header add # syntax=docker/dockerfile:1.1.1-experimental
Use
- Mount local golang cache
# syntax = docker/dockerfile:experimental FROM golang ... RUN --mount=type=cache,target=/root/.cache/go-build go build ...
- Mount cache directory
# syntax = docker/dockerfile:experimental FROM ubuntu RUN rm -f /etc/apt/apt.conf.d/docker-clean; echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache RUN --mount=type=cache,target=/var/cache/apt --mount=type=cache,target=/var/lib/apt \ apt update && apt install -y gcc
- Mount some credentials
# syntax = docker/dockerfile:experimental FROM python:3 RUN pip install awscli RUN --mount=type=secret,id=aws,target=/root/.aws/credentials aws s3 cp s3://... ...
Wait
3.3.2.7 Post-build Cleanup
- Delete Compressed Packet
- Clean up installation cache
- --no-cache
- rm -rf /var/lib/apt/lists/*
- rm -rf /var/cache/yum/*
3.3.2.8 Mirror Compression
export and import combine to compress the image (the compression effect is not obvious)
The disadvantage of this method is that part of the mirror information will be lost
# docker run -d --name nginx nginx:alpine # docker export nginx |docker import - nginx:alpine2 sha256:dd6a3cf822ac3c3ad3e7f7b31675cd8cd99a6f80e360996e04da6fc2f3b98cb5 # docker history nginx:alpine IMAGE CREATED CREATED BY SIZE COMMENT b46db85084b8 10 days ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemon... 0B <missing> 10 days ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B <missing> 10 days ago /bin/sh -c #(nop) EXPOSE 80 0B <missing> 10 days ago /bin/sh -c #(nop) ENTRYPOINT ["/docker-entr... 0B <missing> 10 days ago /bin/sh -c #(nop) COPY file:09a214a3e07c919a... 4.61kB <missing> 10 days ago /bin/sh -c #(nop) COPY file:0fd5fca330dcd6a7... 1.04kB <missing> 10 days ago /bin/sh -c #(nop) COPY file:0b866ff3fc1ef5b0... 1.96kB <missing> 10 days ago /bin/sh -c #(nop) COPY file:65504f71f5855ca0... 1.2kB <missing> 10 days ago /bin/sh -c set -x && addgroup -g 101 -S ... 17.6MB <missing> 10 days ago /bin/sh -c #(nop) ENV PKG_RELEASE=1 0B <missing> 10 days ago /bin/sh -c #(nop) ENV NJS_VERSION=0.7.0 0B <missing> 10 days ago /bin/sh -c #(nop) ENV NGINX_VERSION=1.21.4 0B <missing> 10 days ago /bin/sh -c #(nop) LABEL maintainer=NGINX Do... 0B <missing> 10 days ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B <missing> 10 days ago /bin/sh -c #(nop) ADD file:762c899ec0505d1a3... 5.61MB # docker history nginx:alpine2 IMAGE CREATED CREATED BY SIZE COMMENT dd6a3cf822ac 40 seconds ago 23MB Imported from - # docker images|grep nginx nginx alpine2 dd6a3cf822ac 54 seconds ago 23MB nginx alpine b46db85084b8 10 days ago 23.2MB
3.3 Samples
3.3.1 go sample
Example 1
k8s cluster installed by kubeadm, Dockerfile mirrored by kube-apiserver was compiled using the bazel compilation tool
bazel build ... LABEL maintainers=Kubernetes Authors LABEL description=go based runner for distroless scenarios WORKDIR / COPY /workspace/go-runner . # buildkit ENTRYPOINT ["/go-runner"] COPY file:2e904ea733ba0ded2a99947847de31414a19d83f8495dd8c1fbed3c70bf67a22 in /usr/local/bin/kube-apiserver
Code directory 28M (containing.git directory 20.5M)
Mirror size 122 MB
Example 2
Dockerfile for Open Source Layout Engine Cadence
ARG TARGET=server # Can be used in case a proxy is necessary ARG GOPROXY # Build tcheck binary FROM golang:1.17-alpine3.13 AS tcheck WORKDIR /go/src/github.com/uber/tcheck COPY go.* ./ RUN go build -mod=readonly -o /go/bin/tcheck github.com/uber/tcheck # Build Cadence binaries FROM golang:1.17-alpine3.13 AS builder ARG RELEASE_VERSION RUN apk add --update --no-cache ca-certificates make git curl mercurial unzip WORKDIR /cadence # Making sure that dependency is not touched ENV GOFLAGS="-mod=readonly" # Copy go mod dependencies and build cache COPY go.* ./ RUN go mod download COPY . . RUN rm -fr .bin .build ENV CADENCE_RELEASE_VERSION=$RELEASE_VERSION # bypass codegen, use committed files. must be run separately, before building things. RUN make .fake-codegen RUN CGO_ENABLED=0 make copyright cadence-cassandra-tool cadence-sql-tool cadence cadence-server cadence-bench cadence-canary # Download dockerize FROM alpine:3.11 AS dockerize RUN apk add --no-cache openssl ENV DOCKERIZE_VERSION v0.6.1 RUN wget https://github.com/jwilder/dockerize/releases/download/$DOCKERIZE_VERSION/dockerize-alpine-linux-amd64-$DOCKERIZE_VERSION.tar.gz \ && tar -C /usr/local/bin -xzvf dockerize-alpine-linux-amd64-$DOCKERIZE_VERSION.tar.gz \ && rm dockerize-alpine-linux-amd64-$DOCKERIZE_VERSION.tar.gz \ && echo "**** fix for host id mapping error ****" \ && chown root:root /usr/local/bin/dockerize # Alpine base image FROM alpine:3.11 AS alpine RUN apk add --update --no-cache ca-certificates tzdata bash curl # set up nsswitch.conf for Go's "netgo" implementation # https://github.com/gliderlabs/docker-alpine/issues/367#issuecomment-424546457 RUN test ! -e /etc/nsswitch.conf && echo 'hosts: files dns' > /etc/nsswitch.conf SHELL ["/bin/bash", "-c"] # Cadence server FROM alpine AS cadence-server ENV CADENCE_HOME /etc/cadence RUN mkdir -p /etc/cadence COPY --from=tcheck /go/bin/tcheck /usr/local/bin COPY --from=dockerize /usr/local/bin/dockerize /usr/local/bin COPY --from=builder /cadence/cadence-cassandra-tool /usr/local/bin COPY --from=builder /cadence/cadence-sql-tool /usr/local/bin COPY --from=builder /cadence/cadence /usr/local/bin COPY --from=builder /cadence/cadence-server /usr/local/bin COPY --from=builder /cadence/schema /etc/cadence/schema COPY docker/entrypoint.sh /docker-entrypoint.sh COPY config/dynamicconfig /etc/cadence/config/dynamicconfig COPY config/credentials /etc/cadence/config/credentials COPY docker/config_template.yaml /etc/cadence/config COPY docker/start-cadence.sh /start-cadence.sh WORKDIR /etc/cadence ENV SERVICES="history,matching,frontend,worker" EXPOSE 7933 7934 7935 7939 ENTRYPOINT ["/docker-entrypoint.sh"] CMD /start-cadence.sh # All-in-one Cadence server FROM cadence-server AS cadence-auto-setup RUN apk add --update --no-cache ca-certificates py-pip mysql-client RUN pip install cqlsh COPY docker/start.sh /start.sh CMD /start.sh # Cadence CLI FROM alpine AS cadence-cli COPY --from=tcheck /go/bin/tcheck /usr/local/bin COPY --from=builder /cadence/cadence /usr/local/bin ENTRYPOINT ["cadence"] # Cadence Canary FROM alpine AS cadence-canary COPY --from=builder /cadence/cadence-canary /usr/local/bin COPY --from=builder /cadence/cadence /usr/local/bin CMD ["/usr/local/bin/cadence-canary", "--root", "/etc/cadence-canary", "start"] # Cadence Bench FROM alpine AS cadence-bench COPY --from=builder /cadence/cadence-bench /usr/local/bin COPY --from=builder /cadence/cadence /usr/local/bin CMD ["/usr/local/bin/cadence-bench", "--root", "/etc/cadence-bench", "start"] # Final image FROM cadence-${TARGET}
Code directory 85.4M (including. git directory 57.7M)
Mirror size 135.69MB
3.3.2 py sample
FROM python:3.4 RUN apt-get update \ && apt-get install -y --no-install-recommends \ postgresql-client \ && rm -rf /var/lib/apt/lists/* WORKDIR /usr/src/app COPY requirements.txt ./ RUN pip install -r requirements.txt COPY . . EXPOSE 8000 CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]
Code directory 275M (contains.git directory 222M)
Mirror size 436MB
4. What else to do besides these optimizations
4.1 Set Character Set
Setting up a universal character set in a Dockerfile
# Set lang ENV LANG "en_US.UTF-8"
4.2 Time Zone Correction
More on this issue can be found in my previous articles Multiple Postures for Container Time Problem in k8s Environment
Setting a common time zone in a Dockerfile
# Set timezone RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \ && echo "Asia/Shanghai" > /etc/timezone
4.3 Process Management
When a docker container is running, it defaults to ENTRYPOINT or CMD in the Dockerfile as the main process with PID 1. This process exists for the purpose of "tamping" the container. Once the process does not exist, the container exits.
In addition, an important role for this main process is to manage the "zombie process"
A more official definition of a "zombie process" is a process that completes execution (caused by an exit system call, or a fatal error at run time or a termination signal) but still has its process control block in the operating system's process tables and is in a "terminated state".
The main ideas to clean up the zombie process are
- Set SIGCHLD signal processing function in parent process to SIG_IGN (Ignore Signal);
- fork twice and kills the first-level child process, making the second-level child process an orphan process and being "adopted" and cleaned up by init
Open source solutions currently available
-
Tini
The tini container init is a minimal init system that runs inside the container to start a subprocess and clean up zombies and perform signal forwarding while waiting for the process to exitAdvantage
-
tini prevents application generation zombie processes
-
TiNi handles signals from programs running in the Docker process, and through Tini, SIGTERM terminates the process without requiring you to explicitly install a signal processor
Example
-
# Add Tini ENV TINI_VERSION v0.19.0 ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini RUN chmod +x /tini ENTRYPOINT ["/tini", "--"] # Run your program under Tini CMD ["/your/program", "-and", "-its", "arguments"] # or docker run your-image /your/program ...
-
dumb-init
dumb-init sends the signal it receives to the process group of the child process. For example, when bash receives a signal, it does not send a signal to the child process
dumb-init can also be set by setting the environment variable DUMB_INIT_SETSID=0 to control signaling only to its direct subprocesses
In addition, dumb-init will take over the process that lost its parent to ensure it exits normally
Example
FROM alpine:3.11.5 RUN sed -i "s/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g" /etc/apk/repositories \ && apk add --no-cache dumb-init # Runs "/usr/bin/dumb-init -- /my/script --with --args" ENTRYPOINT ["dumb-init", "--"] # or if you use --rewrite or other cli flags # ENTRYPOINT ["dumb-init", "--rewrite", "2:3", "--"] CMD ["/my/script", "--with", "--args"]
4.4 Degradation Start
In many cases, processes in containers need to be started with reduced privileges to ensure security, which is the same as running a nginx service on a vm, and is best run with a specific reduced privilege user
Examples, tomcat mirroring
... USER tomcat WORKDIR /usr/local/tomcat EXPOSE 8080 ENTRYPOINT ["catalina.sh","run"]
If sudo privileges are required in some cases, avoid installing or using sudo officially in docker because it has unpredictable TTY and signal forwarding behavior that may cause problems. gosu is recommended if you must, for example, initialize the daemon to root but treat it as a non-running root
For example, Official mirror of Postgres Use the following script as its ENTRYPOINT
#!/bin/bash set -e if [ "$1" = 'postgres' ]; then chown -R postgres "$PGDATA" if [ -z "$(ls -A "$PGDATA")" ]; then gosu postgres initdb fi exec gosu postgres "$@" fi exec "$@"
4.5 Bottom Library Dependency
Many times, services rely on the support of some underlying libraries, where a java mirror based on an alpine base image is built to hold a chestnut
alpine does not install much of the commonly used software in order to simplify itself, so glibc is required to use jdk/jre and glibc requires a ca-certificates certificate service (installing glibc pre-dependencies) before installing
Running jdk8 mirror with alpine found that JDK could not be executed. The reason is that java is based on GUN Standard C library(glibc) and alpine is based on MUSL libc(mini libc), so Alpine needs to install the library of glibc
5. Summary
This paper briefly analyses several main reasons why Dockerfile is so large. Based on the production experience, it lists some measures to optimize the size of the mirror and other commonly used treatment methods. Many technical contents are quite cluttered and not mentioned all ~
Reference resources
https://github.com/docker-library/official-images#init
https://wiki.alpinelinux.org/wiki/Running_glibc_programs
See you ~
Pay attention to the public numbers and share more original dry goods with you~