nvidia-container-toolkit

mirror of https://github.com/NVIDIA/nvidia-container-toolkit synced 2025-05-08 14:05:28 +00:00

Author	SHA1	Message	Date
Evan Lezar	181128fe73	Only include by-path-symlinks for injected device nodes Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-22 16:53:04 +02:00
Evan Lezar	2680c45811	Add mode constants to nvcdi Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 16:33:51 +02:00
Evan Lezar	b76808dbd5	Add tests for CDI mode resolution Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 16:33:33 +02:00
Evan Lezar	4ccb0b9a53	Add and resolve auto discovery mode for cdi generation Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 14:49:58 +02:00
Evan Lezar	b21dc929ef	Add WSL2 discovery and spec generation These changes add a wsl discovery mode to the nvidia-ctk cdi generate command. If wsl mode is enabled, the driver store for the available devices is used as the source for discovered entities. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 10:30:13 +02:00
Evan Lezar	d226925fe7	Construct nvml-based CDI lib based on mode Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 10:30:13 +02:00
Evan Lezar	5103adab89	Add mode option to nvcdi API Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-20 10:30:13 +02:00
Kevin Klues	5710b9e7e8	Add globbing for mounting multiple GSP firmware files Newer drivers have split the GSP firmware into multiple files so a simple match against gsp.bin in the firmware directory is no longer possible. This patch adds globbing capabilitis to match any GSP firmware files of the form gsp*.bin and mount them all into the container. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2023-02-16 11:53:36 +00:00
Christopher Desiniotis	a52c9f0ac6	fix: apply options when constructing an instance of the nvcdi library Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>	2023-02-14 16:32:40 -08:00
Evan Lezar	5b110fba2d	Add nvcdi package with basic CDI generation API This change adds an nvcdi package that exposes a basic API for CDI spec generation. This is used from the nvidia-ctk cdi generate command and can be consumed by DRA implementations and the device plugin. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-02-14 19:52:31 +01:00
Evan Lezar	f72b79cc2a	Move pkg to cmd/nvidia-container-toolkit This change moves the pkg folder to `cmd/nvidia-container-toolkit` to better match go best practices. This allows, for example, for the `cmd/nvidia-container-toolkit` to be go installed. The only package included in `pkg` was `main`. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2021-06-08 15:20:59 +02:00
Evan Lezar	2a92d6acb7	Fix bug where docker swarm device selection is overriden by NVIDIA_VISIBLE_DEVICES This change fixes a bug where the value of NVIDIA_VISIBLE_DEVICES would be used to select devices even if the `swarm-resource` config option is specified. Note that this does not change the value of NVIDIA_VISIBLE_DEVICES in the container. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2021-06-07 14:10:08 +02:00
Evan Lezar	602eaf0e60	Use require package for tests Signed-off-by: Evan Lezar <elezar@nvidia.com>	2021-06-07 13:31:41 +02:00
Evan Lezar	fc408a32c7	Add utility function to get config name from struct Signed-off-by: Evan Lezar <elezar@nvidia.com>	2021-01-22 16:08:45 +01:00
Evan Lezar	f6b1b1afad	Ignore NVIDIA_VISIBLE_DEVICES for containers with insufficent privileges This change ignores the value of NVIDIA_VISIBLE_DEVICES instead of raising an error when launching a container with insufficient permissions. This changes the behaviour under the following conditions: NVIDIA_VISIBLE_DEVICES is set and accept-nvidia-visible-devices-envvar-when-unprivileged = false (default: true) or privileged = false (default: false) This means that a user need not explicitly clear the NVIDIA_VISIBLE_DEVICES environment variable if no GPUs are to be used in unprivileged containers. Note that this envvar is set to 'all' by default in many CUDA images that are used as base images. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2021-01-22 15:34:52 +01:00
Kevin Klues	20604621e4	Add 'compute' capability to list of defaults. For most practical purposes, it should be fine to set NVIDIA_DRIVER_CAPABILITIES=all nowadays. Historically, these different capabilities exist because they were added incrementally, with varying degrees of stability. It's fairly common to run with GPUs in containers today, but a few years ago the driver didn't support them very well, and it was important to make sure the libraries being injected into the container actually worked in a containerized environment. When they didn't, it was common to get information leaks, crashes, or even silent failures. In the past, whenever a new set of libraries was being vetted for injected, a new capability was added to make sure that users had control to explicitly include only those libraries they were comfortable having injected into their containers. The idea being that whoever puts together a container image for use with GPUs should have the knowledge of what capabilities the software in that container image requires, and can set the NVIDIA_DRIVER_CAPABILITIES envvar in that image appropriately. After some back and forth, we've decided it doesn't quite make sense to set it to "all" just yet, but we should set it to "utility, compute" instead of just "utility", so that at least the core CUDA libraries work by default (once installed in the container). Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-12-07 12:10:23 +00:00
Kevin Klues	2c1809475c	Add more tests for new semantics with device list from volume mounts Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-08-07 16:30:31 +00:00
Kevin Klues	7c00385797	Refactor accepting device lists from volume mounts as a boolean Also hard code the "root" path where these volume mounts will be looked for rather than making it configurable. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-08-07 16:30:19 +00:00
Kevin Klues	32b4b09bc9	Add tests to verify priority of device list from mounts vs. envvar Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	e48d23d107	Add test for getDevicesFromMounts() Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	8bcd02ee5d	Add logic implementing getDevicesFromMounts() Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	7313069d4c	Update getDevices() to account for getting the devices list from mounts Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	f46d1861d3	Add stub implementation for getDevicesFromMounts() Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	889ebae1fe	Pull logic to get the device list from ENVVARs out to its own function Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	aec9a28bc3	Push HookConfig and privileged flags down to getDevices() call Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	2ae7cb07cf	Add ability to consider container mounts to generate nvidiaConfig Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	da36874e91	Add new config options to pull device list from mounted files not ENVVAR Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	b9ef2db205	Remove unnecessary files from version control Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	da6fbb343a	Revert "Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_*" This reverts commit `01b4381282`.	2020-07-24 12:50:05 +00:00
Kevin Klues	cc0a22a6d9	Consolidate logic for building nvidiaConfig into a single function Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	430dda41e9	Remove getNvidiaConfigLegacy() function A subsequent commit will add equivalent functionality back in Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	4791fab747	Simplify getMigConfigDevices() and getMigMonitorDevices() Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	a24b0c8b4e	Split isLegacyCUDAImage() into its own helper function Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	0a9dc3c653	Add test to make sure that getNvidiaConfig() operates as expected Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:50:05 +00:00
Kevin Klues	fe65573bdf	Add common CI tests for things like golint, gofmt, unit tests, etc This commit also fixes the minor issues uncovered while running these tests locally. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-24 12:14:26 +00:00
Kevin Klues	4e6e0ed4f1	Add 'ngx' to list of all driver capabilities Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-07-22 13:29:39 +00:00
Kevin Klues	d3aee3e092	Add the 'ngx' driver capability Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-06-24 17:53:42 +00:00
Kevin Klues	c32237f39c	Add support for parsing Linux Capabilities for older OCI specs This was added to fix a regression with support for the default runc shipped with CentOS 7. The version of runc that is installed by default on CentOS 7 is 1.0.0-rc2 which uses OCI spec 1.0.0-rc2-dev. This is a prerelease of the OCI spec, which defines the capabilities section of a process configuration to be a flat list of capabilities (e.g. SYS_ADMIN, SYS_PTRACE, SYS_RAWIO, etc.) https://github.com/opencontainers/runtime-spec/blob/v1.0.0-rc2/config.md#process-configuration By the time the official 1.0.0 version of the OCI spec came out, the capabilities section of a process configuration was expanded to include embedded fields for effective, bounding, inheritable, permitted and ambient (each of which can contain a flat list of capabilities of the form SYS_ADMIN, SYS_PTRACE, SYS_RAWIO, etc.) https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#linux-process Previously, we only inspected the capabilities section of a process configuration assuming it was in the format of OCI spec 1.0.0. This patch makes sure we can parse the capaibilites in either format. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-06-03 21:25:13 +00:00
Kevin Klues	8f387816bc	Add support for mig-config and mig-monitor as privileged flags These flags can only be injected into priviliged containers. If the container is unpriviliged, and one of these flags is specified, then we exit with an error. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-05-15 19:04:10 +00:00
Kevin Klues	05012e7b7f	Extend fields we inspect in the runc spec to include linux capabilities This also includes a helper to look through the capabilities contained in the spec to determine if the container is privileged or not. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-05-15 19:04:10 +00:00
Kevin Klues	01b4381282	Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_* This allows someone to (for example) pass the following environment variables: NVIDIA_VISIBLE_DEVICES_0="0,1" NVIDIA_VISIBLE_DEVICES_1="2,3" NVIDIA_VISIBLE_DEVICES_WHATEVER="4,5" and have the nvidia-container-toolkit automatically merge these into: NVIDIA_VISIBLE_DEVICES="0,1,2,3,4,5" This is useful (for example) if the full list of devices comes from multiple, disparate sources. Note: This will override whatever the original value of NVIDIA_VISIBLE_DEVICES was (excluding its original value) if it also exists as an environment variable already. We exclude the original value to ensure that we have a way to override the default value of NVIDIA_VISIBLE_DEVICES set to "all" inside a container image. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2020-05-15 19:04:05 +00:00
Renaud Gaubert	87c8a868f9	Add binary target and use go mod Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2020-04-11 17:18:14 -07:00
Renaud Gaubert	6f4a5a34cf	Init Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2019-10-22 14:36:22 -07:00

1 2 3

143 Commits