nvidia-container-toolkit/cmd/nvidia-container-runtime.experimental/README.md

# The Experimental NVIDIA Container Runtime

## Introduction

The experimental NVIDIA Container Runtime is a proof-of-concept runtime that
approaches the problem of making GPUs (or other NVIDIA devices) available in
containerized environments in a different manner to the existing
[NVIDIA Container Runtime](../nvidia-container-runtime). Wherease the current
runtime relies on the [NVIDIA Container Library](https://github.com/NVIDIA/libnvidia-container)
to perform the modifications to a container, the experiemental runtime aims to
express the required modifications in terms of changes to a container's [OCI
runtime specification](https://github.com/opencontainers/runtime-spec). This
also aligns with open initiatives such as the [Container Device Interface (CDI)](https://github.com/container-orchestrated-devices/container-device-interface).

## Known Limitations

* The path of NVIDIA CUDA libraries / binaries injected into the container currently match that of the host system. This means that on an Ubuntu-based host systems these would be at `/usr/lib/x86_64-linux-gnu`
even if the container distribution would normally expect these at another location (e.g. `/usr/lib64`)
* Tools such as `nvidia-smi` may create additional device nodes in the container when run. This is
prevented in the "classic" runtime (and the NVIDIA Container Library) by modifying
the `/proc/driver/nvidia/params` file in the container.
* Other `NVIDIA_*` environment variables (e.g. `NVIDIA_DRIVER_CAPABILITIES`) are
not considered to filter mounted libraries or binaries. This is equivalent to
always using `NVIDIA_DRIVER_CAPABILITIES=all`.

## Building / Installing

The experimental NVIDIA Container Runtime is a self-contained golang binary and
can thus be built from source, or installed directly using `go install`.

### From source

After cloning the `nvidia-container-toolkit` repository from [GitLab](https://gitlab.com/nvidia/container-toolkit/container-toolkit)
or from the read-only mirror on [GitHub](https://github.com/NVIDIA/nvidia-container-toolkit)
running the following make command in the repository root:
```bash
make cmd-nvidia-container-runtime.experimental
```
will create an executable file `nvidia-container-runtime.experimental` in the
root.

A dockerized target:
```bash
make docker-cmd-nvidia-container-runtime.experimental
```
will also create the executable file `nvidia-container-runtime.experimental`
without requiring the setup of a development environment (with the exception)
of having `make` and `docker` installed.

### Go install

The experimental NVIDIA Container Runtime can also be `go installed` by running
the following command:

```bash
go install github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime.experimental@experimental
```
which will build and install the `nvidia-container-runtime.experimental`
executable in the `${GOPATH}/bin` folder.

## Using the Runtime

The experimental NVIDIA Container Runtime is intended as a drop-in replacement
for the "classic" NVIDIA Container Runtime. As such it is used in the same
way (with the exception of the known limitiations noted above).

In general terms, to use the experimental NVIDIA Container Runtime to launch a
container with GPU support, it should be inserted as a shim for the desired
low-level OCI-compliant runtime (e.g. `runc` or `crun`). How this is achieved
depends on how containers are being launched.

### Docker
In the case of `docker` for example, the runtime must be registered with the
Docker daemon. This can be done by modifying the `/etc/docker/daemon.json` file
to contain the following:
```json
{
    "runtimes": {
        "nvidia-experimental": {
                "path": "nvidia-container-runtime.experimental",
                "runtimeArgs": []
        }
    },
}
```
This can then be invoked from docker by including the `--runtime=nvidia-experimental`
option when executing a `docker run` command.

### Runc

If `runc` is being used to run a container directly substituting the `runc`
command for `nvidia-container-runtime.experimental` should be sufficient as
the latter will `exec` to `runc` once the required (in-place) modifications have
been made to the container's OCI spec (`config.json` file).

## Configuration

### Runtime Path
The experimental NVIDIA Container Runtime allows for the path to the low-level
runtime to be specified. This is done by setting the following option in the
`/etc/nvidia-container-runtime/config.toml` file or setting the
`NVIDIA_CONTAINER_RUNTIME_PATH` environment variable.

```toml
[nvidia-container-runtime.experimental]

runtime-path = "/path/to/low-level-runtime"
```
This path can be set to the path for `runc` or `crun` on a system and if it is
a relative path, the `PATH` is searched for a matching executable.

### Device Selection
In order to select a specific device, the experimental NVIDIA Container Runtime
mimics the behaviour of the "classic" runtime. That is to say that the values of
certain [environment variables](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#environment-variables-oci-spec)
**in the container's OCI specification** control the behaviour of the runtime.

#### `NVIDIA_VISIBLE_DEVICES`
This variable controls which GPUs will be made accessible inside the container.

##### Possible values
* `0,1,2`, `GPU-fef8089b` …: a comma-separated list of GPU UUID(s) or index(es).
* `all`: all GPUs will be accessible, this is the default value in our container images.
* `none`: no GPU will be accessible, but driver capabilities will be enabled.
* `void` or *empty* or *unset*: `nvidia-container-runtime` will have the same behavior as `runc`.

**Note**: When running on a MIG capable device, the following values will also be available:
* `0:0,0:1,1:0`, `MIG-GPU-fef8089b/0/1` …: a comma-separated list of MIG Device UUID(s) or index(es).

Where the MIG device indices have the form `<GPU Device Index>:<MIG Device Index>` as seen in the example output:
```
$ nvidia-smi -L
GPU 0: Graphics Device (UUID: GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5)
  MIG Device 0: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/0)
  MIG Device 1: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/1)
  MIG Device 2: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/11/0)
```

#### `NVIDIA_MIG_CONFIG_DEVICES`
This variable controls which of the visible GPUs can have their MIG
configuration managed from within the container. This includes enabling and
disabling MIG mode, creating and destroying GPU Instances and Compute
Instances, etc.

##### Possible values
* `all`: Allow all MIG-capable GPUs in the visible device list to have their
  MIG configurations managed.

**Note**:
* This feature is only available on MIG capable devices (e.g. the A100).
* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges.
* When not running as `root`, the container user must have read access to the
  `/proc/driver/nvidia/capabilities/mig/config` file on the host.

#### `NVIDIA_MIG_MONITOR_DEVICES`
This variable controls which of the visible GPUs can have aggregate information
about all of their MIG devices monitored from within the container. This
includes inspecting the aggregate memory usage, listing the aggregate running
processes, etc.

##### Possible values
* `all`: Allow all MIG-capable GPUs in the visible device list to have their
  MIG devices monitored.

**Note**:
* This feature is only available on MIG capable devices (e.g. the A100).
* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges.
* When not running as `root`, the container user must have read access to the
  `/proc/driver/nvidia/capabilities/mig/monitor` file on the host.