From cf192169a8982fccf168ada86acfce54dc0984eb Mon Sep 17 00:00:00 2001
From: Evan Lezar <elezar@nvidia.com>
Date: Wed, 8 Dec 2021 21:43:06 +0100
Subject: [PATCH] Add basic README for experimental runtime

Signed-off-by: Evan Lezar <elezar@nvidia.com>
---
 .../README.md                                 | 170 ++++++++++++++++++
 1 file changed, 170 insertions(+)
 create mode 100644 cmd/nvidia-container-runtime.experimental/README.md

diff --git a/cmd/nvidia-container-runtime.experimental/README.md b/cmd/nvidia-container-runtime.experimental/README.md
new file mode 100644
index 00000000..810dada7
--- /dev/null
+++ b/cmd/nvidia-container-runtime.experimental/README.md
@@ -0,0 +1,170 @@
+# The Experimental NVIDIA Container Runtime
+
+## Introduction
+
+The experimental NVIDIA Container Runtime is a proof-of-concept runtime that
+approaches the problem of making GPUs (or other NVIDIA devices) available in
+containerized environments in a different manner to the existing
+[NVIDIA Container Runtime](../nvidia-container-runtime). Wherease the current
+runtime relies on the [NVIDIA Container Library](https://github.com/NVIDIA/libnvidia-container)
+to perform the modifications to a container, the experiemental runtime aims to
+express the required modifications in terms of changes to a container's [OCI
+runtime specification](https://github.com/opencontainers/runtime-spec). This
+also aligns with open initiatives such as the [Container Device Interface (CDI)](https://github.com/container-orchestrated-devices/container-device-interface).
+
+## Known Limitations
+
+* The path of NVIDIA CUDA libraries / binaries injected into the container currently match that of the host system. This means that on an Ubuntu-based host systems these would be at `/usr/lib/x86_64-linux-gnu`
+even if the container distribution would normally expect these at another location (e.g. `/usr/lib64`)
+* Tools such as `nvidia-smi` may create additional device nodes in the container when run. This is
+prevented in the "classic" runtime (and the NVIDIA Container Library) by modifying
+the `/proc/driver/nvidia/params` file in the container.
+* Other `NVIDIA_*` environment variables (e.g. `NVIDIA_DRIVER_CAPABILITIES`) are
+not considered to filter mounted libraries or binaries. This is equivalent to
+always using `NVIDIA_DRIVER_CAPABILITIES=all`.
+
+## Building / Installing
+
+The experimental NVIDIA Container Runtime is a self-contained golang binary and
+can thus be built from source, or installed directly using `go install`.
+
+### From source
+
+After cloning the `nvidia-container-toolkit` repository from [GitLab](https://gitlab.com/nvidia/container-toolkit/container-toolkit)
+or from the read-only mirror on [GitHub](https://github.com/NVIDIA/nvidia-container-toolkit)
+running the following make command in the repository root:
+```bash
+make cmd-nvidia-container-runtime.experimental
+```
+will create an executable file `nvidia-container-runtime.experimental` in the
+root.
+
+A dockerized target:
+```bash
+make docker-cmd-nvidia-container-runtime.experimental
+```
+will also create the executable file `nvidia-container-runtime.experimental`
+without requiring the setup of a development environment (with the exception)
+of having `make` and `docker` installed.
+
+### Go install
+
+The experimental NVIDIA Container Runtime can also be `go installed` by running
+the following command:
+
+```bash
+go install github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime.experimental@experimental
+```
+which will build and install the `nvidia-container-runtime.experimental`
+executable in the `${GOPATH}/bin` folder.
+
+## Using the Runtime
+
+The experimental NVIDIA Container Runtime is intended as a drop-in replacement
+for the "classic" NVIDIA Container Runtime. As such it is used in the same
+way (with the exception of the known limitiations noted above).
+
+In general terms, to use the experimental NVIDIA Container Runtime to launch a
+container with GPU support, it should be inserted as a shim for the desired
+low-level OCI-compliant runtime (e.g. `runc` or `crun`). How this is achieved
+depends on how containers are being launched.
+
+### Docker
+In the case of `docker` for example, the runtime must be registered with the
+Docker daemon. This can be done by modifying the `/etc/docker/daemon.json` file
+to contain the following:
+```json
+{
+    "runtimes": {
+        "nvidia-experimental": {
+                "path": "nvidia-container-runtime.experimental",
+                "runtimeArgs": []
+        }
+    },
+}
+```
+This can then be invoked from docker by including the `--runtime=nvidia-experimental`
+option when executing a `docker run` command.
+
+### Runc
+
+If `runc` is being used to run a container directly substituting the `runc`
+command for `nvidia-container-runtime.experimental` should be sufficient as
+the latter will `exec` to `runc` once the required (in-place) modifications have
+been made to the container's OCI spec (`config.json` file).
+
+## Configuration
+
+### Runtime Path
+The experimental NVIDIA Container Runtime allows for the path to the low-level
+runtime to be specified. This is done by setting the following option in the
+`/etc/nvidia-container-runtime/config.toml` file or setting the
+`NVIDIA_CONTAINER_RUNTIME_PATH` environment variable.
+
+```toml
+[nvidia-container-runtime.experimental]
+
+runtime-path = "/path/to/low-level-runtime"
+```
+This path can be set to the path for `runc` or `crun` on a system and if it is
+a relative path, the `PATH` is searched for a matching executable.
+
+### Device Selection
+In order to select a specific device, the experimental NVIDIA Container Runtime
+mimics the behaviour of the "classic" runtime. That is to say that the values of
+certain [environment variables](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#environment-variables-oci-spec)
+**in the container's OCI specification** control the behaviour of the runtime.
+
+#### `NVIDIA_VISIBLE_DEVICES`
+This variable controls which GPUs will be made accessible inside the container.
+
+##### Possible values
+* `0,1,2`, `GPU-fef8089b` …: a comma-separated list of GPU UUID(s) or index(es).
+* `all`: all GPUs will be accessible, this is the default value in our container images.
+* `none`: no GPU will be accessible, but driver capabilities will be enabled.
+* `void` or *empty* or *unset*: `nvidia-container-runtime` will have the same behavior as `runc`.
+
+**Note**: When running on a MIG capable device, the following values will also be available:
+* `0:0,0:1,1:0`, `MIG-GPU-fef8089b/0/1` …: a comma-separated list of MIG Device UUID(s) or index(es).
+
+Where the MIG device indices have the form `<GPU Device Index>:<MIG Device Index>` as seen in the example output:
+```
+$ nvidia-smi -L
+GPU 0: Graphics Device (UUID: GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5)
+  MIG Device 0: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/0)
+  MIG Device 1: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/1/1)
+  MIG Device 2: (UUID: MIG-GPU-b8ea3855-276c-c9cb-b366-c6fa655957c5/11/0)
+```
+
+#### `NVIDIA_MIG_CONFIG_DEVICES`
+This variable controls which of the visible GPUs can have their MIG
+configuration managed from within the container. This includes enabling and
+disabling MIG mode, creating and destroying GPU Instances and Compute
+Instances, etc.
+
+##### Possible values
+* `all`: Allow all MIG-capable GPUs in the visible device list to have their
+  MIG configurations managed.
+
+**Note**:
+* This feature is only available on MIG capable devices (e.g. the A100).
+* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges.
+* When not running as `root`, the container user must have read access to the
+  `/proc/driver/nvidia/capabilities/mig/config` file on the host.
+
+#### `NVIDIA_MIG_MONITOR_DEVICES`
+This variable controls which of the visible GPUs can have aggregate information
+about all of their MIG devices monitored from within the container. This
+includes inspecting the aggregate memory usage, listing the aggregate running
+processes, etc.
+
+##### Possible values
+* `all`: Allow all MIG-capable GPUs in the visible device list to have their
+  MIG devices monitored.
+
+**Note**:
+* This feature is only available on MIG capable devices (e.g. the A100).
+* To use this feature, the container must be started with `CAP_SYS_ADMIN` privileges.
+* When not running as `root`, the container user must have read access to the
+  `/proc/driver/nvidia/capabilities/mig/monitor` file on the host.
+