From ebff62f56bf4e34346b790c5d8571383e7d000e8 Mon Sep 17 00:00:00 2001 From: Evan Lezar Date: Mon, 23 Oct 2023 13:35:19 +0200 Subject: [PATCH] Update nvidia-container-runtime README Signed-off-by: Evan Lezar --- cmd/nvidia-container-runtime/README.md | 123 +++++++++++++++++++++++++ 1 file changed, 123 insertions(+) diff --git a/cmd/nvidia-container-runtime/README.md b/cmd/nvidia-container-runtime/README.md index 885eec12..374e391d 100644 --- a/cmd/nvidia-container-runtime/README.md +++ b/cmd/nvidia-container-runtime/README.md @@ -85,3 +85,126 @@ Alternatively the NVIDIA Container Runtime can be set as the default runtime for } } ``` + +## Environment variables (OCI spec) + +Each environment variable maps to an command-line argument for `nvidia-container-cli` from [libnvidia-container](https://github.com/NVIDIA/libnvidia-container). +These variables are already set in our [official CUDA images](https://hub.docker.com/r/nvidia/cuda/). + +### `NVIDIA_VISIBLE_DEVICES` +This variable controls which GPUs will be made accessible inside the container. + +#### Possible values +* `0,1,2`, `GPU-fef8089b` …: a comma-separated list of GPU UUID(s) or index(es). +* `all`: all GPUs will be accessible, this is the default value in our container images. +* `none`: no GPU will be accessible, but driver capabilities will be enabled. +* `void` or *empty* or *unset*: `nvidia-container-runtime` will have the same behavior as `runc`. + +**Note**: When running on a MIG capable device, the following values will also be available: +* `0:0,0:1,1:0`, `MIG-GPU-fef8089b/0/1` …: a comma-separated list of MIG Device UUID(s) or index(es). + +Where the MIG device indices have the form `:` as seen in the example output: +``` +$ nvidia-smi -L +GPU 0: Graphics Device (UUID: GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5) + MIG Device 0: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/0) + MIG Device 1: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/1) + MIG Device 2: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/11/0) +``` + +### `NVIDIA_MIG_CONFIG_DEVICES` +This variable controls which of the visible GPUs can have their MIG +configuration managed from within the container. This includes enabling and +disabling MIG mode, creating and destroying GPU Instances and Compute +Instances, etc. + +#### Possible values +* `all`: Allow all MIG-capable GPUs in the visible device list to have their + MIG configurations managed. + +**Note**: +* This feature is only available on MIG capable devices (e.g. the A100). +* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges. +* When not running as `root`, the container user must have read access to the + `/proc/driver/nvidia/capabilities/mig/config` file on the host. + +### `NVIDIA_MIG_MONITOR_DEVICES` +This variable controls which of the visible GPUs can have aggregate information +about all of their MIG devices monitored from within the container. This +includes inspecting the aggregate memory usage, listing the aggregate running +processes, etc. + +#### Possible values +* `all`: Allow all MIG-capable GPUs in the visible device list to have their + MIG devices monitored. + +**Note**: +* This feature is only available on MIG capable devices (e.g. the A100). +* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges. +* When not running as `root`, the container user must have read access to the + `/proc/driver/nvidia/capabilities/mig/monitor` file on the host. + +### `NVIDIA_DRIVER_CAPABILITIES` +This option controls which driver libraries/binaries will be mounted inside the container. + +#### Possible values +* `compute,video`, `graphics,utility` …: a comma-separated list of driver features the container needs. +* `all`: enable all available driver capabilities. +* *empty* or *unset*: use default driver capability: `utility,compute`. + +#### Supported driver capabilities +* `compute`: required for CUDA and OpenCL applications. +* `compat32`: required for running 32-bit applications. +* `graphics`: required for running OpenGL and Vulkan applications. +* `utility`: required for using `nvidia-smi` and NVML. +* `video`: required for using the Video Codec SDK. +* `display`: required for leveraging X11 display. + +### `NVIDIA_REQUIRE_*` +A logical expression to define constraints on the configurations supported by the container. + +#### Supported constraints +* `cuda`: constraint on the CUDA driver version. +* `driver`: constraint on the driver version. +* `arch`: constraint on the compute architectures of the selected GPUs. +* `brand`: constraint on the brand of the selected GPUs (e.g. GeForce, Tesla, GRID). + +#### Expressions +Multiple constraints can be expressed in a single environment variable: space-separated constraints are ORed, comma-separated constraints are ANDed. +Multiple environment variables of the form `NVIDIA_REQUIRE_*` are ANDed together. + +### `NVIDIA_DISABLE_REQUIRE` +Single switch to disable all the constraints of the form `NVIDIA_REQUIRE_*`. + +### `NVIDIA_REQUIRE_CUDA` + +The version of the CUDA toolkit used by the container. It is an instance of the generic `NVIDIA_REQUIRE_*` case and it is set by official CUDA images. +If the version of the NVIDIA driver is insufficient to run this version of CUDA, the container will not be started. + +#### Possible values +* `cuda>=7.5`, `cuda>=8.0`, `cuda>=9.0` …: any valid CUDA version in the form `major.minor`. + +### `CUDA_VERSION` +Similar to `NVIDIA_REQUIRE_CUDA`, for legacy CUDA images. +In addition, if `NVIDIA_REQUIRE_CUDA` is not set, `NVIDIA_VISIBLE_DEVICES` and `NVIDIA_DRIVER_CAPABILITIES` will default to `all`. + +## Usage example + +**NOTE:** The use of the `nvidia-container-runtime` as CLI replacement for `runc` is uncommon and is only provided for completeness. + +Although the `nvidia-container-runtime` is typically configured as a replacement for `runc` or `crun` in various container engines, it can also be +invoked from the command line as `runc` would. For example: + +```sh +# Setup a rootfs based on Ubuntu 16.04 +cd $(mktemp -d) && mkdir rootfs +curl -sS http://cdimage.ubuntu.com/ubuntu-base/releases/16.04/release/ubuntu-base-16.04-core-amd64.tar.gz | tar --exclude 'dev/*' -C rootfs -xz + +# Create an OCI runtime spec +nvidia-container-runtime spec +sed -i 's;"sh";"nvidia-smi";' config.json +sed -i 's;\("TERM=xterm"\);\1, "NVIDIA_VISIBLE_DEVICES=0";' config.json + +# Run the container +sudo nvidia-container-runtime run nvidia_smi +```