This change adds a --relative-to option to the nvidia-ctk transform root
command. This defaults to "host" maintaining the existing behaviour.
If --relative-to=container is specified, the root transform is applied to
container paths in the CDI specification instead of host paths.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the root transformer to indicate that it
operates on host paths and adds a container root transformer for
explicitly transforming container roots.
The transform.NewRootTransformer constructor still exists, but has
been marked as deprecated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to using the reflect package to determine
the type of config options instead of inferring the type from the
Toml data structure.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for an NVIDIA_NVSWITCH environment variable.
When set to `enabled` this striggers the injection of all available
/dev/nvidia-nvswitch* device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a driver root abstraction that defines how
libraries are located relative to the root. This allows for
this driver root to be constructed once and passed to discovery
code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Instead of relying solely on a static config, we resolve the path
to ldconfig. The path is checked for existence and a .real suffix is preferred.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames NewGraphicsDiscoverer to NewDRMNodesDiscoverer and
instead calls NewGraphicsMountsDiscoverer explicitly when constructing
a graphics modifier.
This avoids the import of config.Config into the discover package
which leads to a transitive dependency on toml-specifics and
requires that the vendor/github.com/pelletier/ package
be vendored in to consumers.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
A driverRoot defines both the driver library root and the
root for device nodes. In the case of preinstalled drivers or
the driver container, these are equal, but in cases such as GKE
they do not match. In this case, drivers are extracted to a folder
and devices exist at the root /.
The changes here add a devRoot option to the nvcdi API that allows the
parent of /dev to be specified explicitly.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In some cases we might get a permission error trying to chmod -
most likely this is due to something beyond our control
like whole `/dev` being mounted.
Do not fail container creation in this case.
Due to loosing control of the program after `exec()`-ing `chmod(1)` program
and therefore not being able to handle errors -
refactor to use `chmod(2)` syscall instead of `exec()` `chmod(1)` program.
Fixes: #143
Signed-off-by: Ievgen Popovych <jmennius@gmail.com>
This change skips the update of ld.cache in the container if it
doesn't exist. Instead, the -N flag is used to only create the
relevant symlinks.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we allow pattern inputs for locating symlinks we could have
multiples. The error being checked is resolved by the deduplication.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes make library lookups more robust. The core change is that
library lookups now first look a set of predefined locations before checking
the ldcache. This also handles cases where an ldcache is not available more
gracefully.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows CDI devices to be requested as mounts in the
container. This enables their use in environments such as kind
where environment variables or annotations cannot be used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change refactors the use of the symlink filter to make it extendible.
A blocked filter can be set on the Tegra CSV discoverer to ensure that the correct
symlink libraries are filtered out. Here, globs can be used to select mulitple libraries,
and a **/ prefix on the globs indicates that the pattern that follows is only applied to
the filename of the symlink entry in the CSV file.
A --csv.ignore-pattern command line argument is added to the nvidia-ctk cdi generate
command that allows this to be set.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change improves the testibility of the CSV discoverer.
This is done by adding injection points for mocks for library discovery and
symlink resolution.
Note that this highlights a bug in the current implementation where the
library filter causes valid symlinks to be skipped.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a "required" option to the new toml config
that controls whether a default config is returned or not.
This is useful from the NVIDIA Container Runtime Hook, where
/run/driver/nvidia/etc/nvidia-container-runtime/config.toml
is checked before the standard path.
This fixes a bug where the default config was always applied
when this config was not used.
See https://github.com/NVIDIA/nvidia-container-toolkit/issues/106
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a UsesNVGPUModule function that checks whether the nvgpu
kernel module is used by NVML. This allows for more robust detection of
Tegra-based platforms where libnvidia-ml.so is supported to enumerate the
iGPU.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates go-nvlib to include logic to skip NVIDIA PCI-E
devices where the name or class id is not known.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the csv.library-search-path option to
library-search-path so as to be more generally applicable in
future. Note that the option is still only applied in csv mode.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne simplifies the nvidia-ctk config default command.
By default it now outputs the default config to STDOUT, and can
optionally output this to file.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change introduced a config.Toml type that is used as the base for
config file processing and manipulation. This ensures that configs --
including commented values -- can be handled consistently.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Migrate to internal/config.Config structs for the NVIDIA Container Runtime Hook config.
See merge request nvidia/container-toolkit/container-toolkit!463
This change ensures that the Config structs from internal.Config
are used for the NVIDIA Container Runtime Hook config too.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes installation of the oci-nvidia-hook files.
These files conflict with CDI use in runtimes that support it.
The use of the hook should be considered deprecated on these platforms.
If a hook is required, the
nvidia-ctk runtime configure --config-mode=oci-hook
command should be used to create the hook file(s).
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change extends the nvidia-ctk runtime configure command
with a --config-mode=oci-hook that creates an OCI hook json file.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the centos7 and ubuntu18.04 packages are
published to the generic rpm and deb repos, respectively.
All other packages except the centos8-ppc64le packages are skipped
as these use cases are covered by the generic packages.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In order to properly handle systems with both iGPU and dGPU
drivers included, we skip "sym" mount specifications which
refer to .so or .so.[1-9] files.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change explicitly generates a CDI specification from
the supplied CSV files when cdi mode is detected. This
ensures consistency between the behaviour on Tegra-based
systems.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
If the config.toml has an empty root specified, this could be
passed to the NVIDIA Container CLI through the --root flag
which causes argument parsing to fail. This change only
adds the --root flag if the config option is specified
and is non-empty.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change generates CDI specifications for Tegra devices
with the nvidia.com/gpu=0 name by default. The type-index
nameing strategy is also supported and will generate a device
with the name nvidia.com/gpu=gpu0.
The uuid naming strategy will raise an error if selected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the incoming OCI spec has already been parsed and used to
construct a CUDA image representation, pass this to the CSV
modifier constructor instead of re-creating an image representation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change sets the default CDI spec dirs at a config level instead
of when a CDI runtime modifier is constructed. This makes this setting
consistent with other options such as the nvidia-ctk path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Include Shared Compiler Library (libnvidia-gpucomp.so) in the list of compute libaries.
See merge request nvidia/container-toolkit/container-toolkit!442
Path to locate the GSP firmware is explicitly set to /lib/firmware/nvidia.
Users may chose to install the GSP firmware in alternate locations where
the kernel would look for firmware on the root filesystem.
Add locate functionality which looks for the GSP firmware files in the
same location as the kernel would
(https://docs.kernel.org/driver-api/firmware/fw_search_path.html).
The paths searched in order are:
- path described in /sys/module/firmware_class/parameters/path
- /lib/firmware/updates/UTS_RELEASE/
- /lib/firmware/updates/
- /lib/firmware/UTS_RELEASE/
- /lib/firmware/
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The debian and rpm packages are updated to trigger the generation of
of a default config if no config exists at the expected location.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the nvidia-ctk config default command
generates a config file that is compatible with the official documentation
to, for example, disable cgroups in the NVIDIA Container CLI.
This requires that whitespace around comments is stripped before outputing the
contets.
This also adds an option to load a config and modify it in-place instead. This can
be triggered as a post-install step, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The stable-dind image is out of date and has not been updated for 3 years.
This change updates to the latest dind image.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the actual discovered path of nvidia-smi when
creating a symlink to the binary on WSL2 platforms.
This ensures that cases where multiple driver store paths are
detected are supported.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This changes splits the functionality in the internal system package
into two packages: one for dealing with devices and one for dealing
with kernel modules. This removes ambiguity around the meaning of
driver / device roots in each case.
In each case, a root can be specified where device nodes are created
or kernel modules loaded.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Thsi change adds the --nvidia-runtime-dir as a command line
argument when configuring container runtimes in the toolkit container.
This removes the need to set it via the command line.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --create-device-nodes option to the
nvidia-ctk system create-dev-char-symlinks command to create
device nodes. The currently only creates control device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add a --load-kernel-modules option to the
nvidia-ctk system commands. If specified the NVIDIA kernel modules
(nvidia, nvidia-uvm, and nvidia-modeset) are loaded before any
operations on device nodes are performed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the installed NVIDIA Container Runtime Hook wrapper
as the path in the applied config. This prevents conflicts with other
installations of the NVIDIA Container Toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime-hook.path config option
to allow the path used for the prestart hook to be overridden. This
is useful in cases where multiple NVIDIA Container Toolkit installations
are present.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to generating a OCI runtime hook to create
individual symlinks instead of processing a CSV file in the hook.
This allows for better reuse of the logic generating CDI
specifications, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a symlinks.Resolve function for resolving symlinks and
updates usages across the code to make use of it.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates go-nvlib to ensure that non-migcapable GPUs
are skipped when generating CDI specifications for MIG devices.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne allows the csv mode option to specified in the
nvidia-ctk cdi generate command and adds a --csv.file option
that can be repeated to specify the CSV files to be processed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that libcuda.so can be located on systems
where no patch version is specified in the driver version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The nvcid api is extended to allow for merged device options to
be specified. If any options are specified, then a merged device
is generated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes remove the use of discover.Config which was used
to pass the driver root and the nvidiaCTK path in some cases.
Instead, the nvidiaCTKPath is resolved at the begining of runtime
invocation to ensure that this is valid at all points where it is
used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the default configuration is now platform specific,
there is no need to install specific versions as part of
the package installation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Also include manifest.txt with, for example, version
info when extracting packages from the packagin image.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a GetDefaultConfigToml function to the config package.
This function returns the default config in the form of raw TOML
including comments. This is useful for generating a default config at
installation time, with platform-specific differences codified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This ensures that the artifacts associated with a particular
release version are preserved along with the container
images that are used as operands for this version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a CLI command to generate a default config.
This config checks the host operating system to apply specific
modifications that were previously captured in static config
files.
These include:
* select /sbin/ldconfig or /sbin/ldconfig.real depending on which exists on the host
* set the user to allow device access on SUSE-based systems
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the struct for storing CLI flag values options over
config to avoid a conflict with the config package.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Generate CDI specifications with 644 permissions to allow non-root clients to consume them
See merge request nvidia/container-toolkit/container-toolkit!381
By default, temporary files are created with permissions 600 and
this means that the files created when updating the ldcache are
not readable in non-root containers.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Support building nvidia-docker and nvidia-container-runtime as dist-independent packages
See merge request nvidia/container-toolkit/container-toolkit!379
In order to add the libnvidia-container0 packages to our ubuntu18.04
kitmaker archive, a workaround was added that downloaded these packages
before constructing the archive. Since the packages have now been
published -- and will not change -- this workaround is not longer needed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the nvidia-docker2 and nvidia-container-runtime
components are not build and distributed for patch releases.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.runtimes config option.
If this is unset no changes are made to the config and the default values are used. This
allows this setting to be overridden in cases where this is required. One such example is
crio where crun is set as the default runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk system create-device-nodes command for
creating NVIDIA device nodes. Currently this is limited to control devices
(nvidia-uvm, nvidia-uvm-tools, nvidia-modeset, nvidiactl).
A --dry-run mode is included for outputing commands that would be executed and
the driver root can be specified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.modes.cdi.annotation-prefixes config
option that defaults to cdi.k8s.io/. This allows the annotation prefixes parsed
for CDI devices to be overridden in cases where CDI support in container engines such
as containerd or crio need to be overridden.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows nvcdi.New to return an error in addition to the
constructed library instead of panicing.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
As simplified CDI spec has no duplicate entities in any single set of container edits.
Furthermore, contianer edits defined at a spec-level are not included in the container
edits for a device.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we relied on finding libcuda.so in the LDCache to determine both the CUDA
version and the expected directory for the driver libraries, the generation of the
management CDI specifications fails in containers where the LDCache has not been updated.
This change falls back to searching a set of predefined paths instead when the lookup of
libcuda.so in the cache fails.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
CDI generation modes such as management and wsl don't require
NVML. This change removes the top-level instantiation of nvmllib
and replaces it with an instanitation in the nvml CDI spec generation
code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne prefers (non-symlink) sockets at /run over /var/run for
nvidia-persistenced and nvidia-fabricmanager sockets.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds .cdi and .legacy mode-specific runtimes the list of
runtimes supported by the operator. These are also installed as
part of the NVIDIA Container Toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to using nvidia-container-runtime.experimental as the
wrapper name over nvidia-container-runtime-experimental. This is consistent
with upcoming mode-specific binaries.
The wrapper is created at nvidia-container-runtime.experimental.real.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the nvidia-container-runtime.modes.cdi.default-kind
to be set in the toolkit-container.
The NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND envvar is used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The following changes are made:
* The default-cdi-kind config option is used to convert an envvar entry to a fully-qualified device name
* If annotation devices exist, these are used instead of the envvar devices.
* The `all` device is no longer treated as a special case and MUST exist in the CDI spec.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add support for generating a management spec to the nvcdi API.
A management spec consists of a single CDI device (`all`) which includes all expected
NVIDIA device nodes, driver libraries, binaries, and IPC sockets.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Fix handling of envvars in toolkit container which modify the NVIDIA Container Runtime config
See merge request nvidia/container-toolkit/container-toolkit!317
This change generates device folder permission hooks per device instead of
at a spec level. This ensures that the hook is not injected for a device that
does not have any nested device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add a wsl discovery mode to the nvidia-ctk cdi generate command.
If wsl mode is enabled, the driver store for the available devices is used as
the source for discovered entities.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds --discovery-mode flag to the nvidia-ctk cdi generate
command and plumbs this through to the CDI API.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change copies dxcore.h and dxcore.c from libnvidia-container to
allow for the driver store path to be queried. Modifications are made
to dxcore to remove the code associated with checking the components
in the driver store path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the nvidia-container-runtime.mode option to be set
by the toolkit container.
This is controlled by the --nvidia-container-runtime-mode command line
argument and the NVIDIA_CONTAINER_RUNTIME_MODE envvar.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Newer drivers have split the GSP firmware into multiple files so a simple match
against gsp.bin in the firmware directory is no longer possible. This patch
adds globbing capabilitis to match any GSP firmware files of the form gsp*.bin
and mount them all into the container.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This change adds an nvcdi package that exposes a basic API for
CDI spec generation. This is used from the nvidia-ctk cdi generate
command and can be consumed by DRA implementations and the device plugin.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.cdi executable that
overrides the runtime mode from the config to "cdi".
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The version for RPM release candidates has the form `1.13.0-0.1.rc.1-1` whereas debian packages have the form `1.13.0~rc.1-1`.
Note that since the `~` is handled in [the same way](https://docs.fedoraproject.org/en-US/packaging-guidelines/Versioning/#_handling_non_sorting_versions_with_tilde_dot_and_caret) as for Debian packages, there does not seem to be a specific reason for this and dealing with multiple version strings in our entire pipeline adds complexity.
This change aligns the package versioning for rpm packages with Debian packages.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This makes the intent of the command line argument clearer since this
relates specifically to the root where the NVIDIA driver is installed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the update-ldcache hook is created in a manner
consistent with other nvidia-ctk hooks ensuring that a full path is
used.
Without this change the update-ldcache hook on Tegra-based sytems had an
invalid path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
If this is not done, the default config which sets the nvidia-ctk.path
option as "nvidia-ctk" will result in an invalid OCI spec if a hook is
injected. This change ensures that the path used is always an absolute
path as required by the hook spec.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change simplifies how Kitmaker archives are constructed.
Currently only centos8 and ubuntu18.04 packages are included.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the `index` mode for the --device-name-strategy when
generating CDI specifications by default. This generates device names
such as nvidia.com/gpu=0 or nvidia.com/gpu=1:0 by default.
Note that this requires a CDI spec version of 0.5.0 and for consumers
(e.g. podman) that are only compatible with older versions one of the
other stragegies (`type-index` or `uuid`) should be used instead to
generate a v0.3.0 or v0.4.0 specification.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --create-all mode to the create-dev-char-symlinks hook.
This mode creates all POSSIBLE symlinks to device nodes for regular and cap
devices. With the number of GPUs inferred from the PCI device information.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --watch option to the create-dev-char-symlinks hook. This
installs an fsnotify watcher that creates symlinks for ADDED device nodes under
/dev/char.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk hook create-dev-char-symlinks
subcommand that creates symlinks to device nodes (as required by
systemd) under /dev/char.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --device-name-strategy flag for generating a CDI
specificaion. This allows a CDI spec to be generated with the following
names used for device:
* type-index: gpu0 and mig0:1
* index: 0 and 0:1
* uuid: GPU and MIG UUIDs
Note that the use of 'index' generates a v0.5.0 CDI specification since
this relaxes the restriction on the device names.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the first match of an executable in the path
is retured instead of a list of candidates. This prevents a CDI spec,
for example, from containing multiple entries for a single executable
(e.g. nvidia-smi).
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses functionality from the CDI package to determine
the minimum required CDI spec version. This allows for a spec with
the widest compatibility to be specified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change implements the discovery of versioned driver libaries
by reusing the mounts and update ldcache discoverers use for, for example,
CVS file discovery. This allows the container paths to be correctly generated
without requiring specific manipulation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an --nvidia-ctk-path to the nvidia-ctk cdi generate
command. This ensures that the executable path for the generated
hooks can be specified consistently.
Since the NVIDIA Container Runtime already allows for the executable
path to be specified in the config the utility code to update the
LDCache and create other nvidia-ctk hooks are also updated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change refactors the generation of CDI specifications
to use discoverers and generate the CDI specifications from these
discoverers. This allows for better reuse.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The HostPath field was added in the v0.5.0 CDI specification.
The cdi package uses strict unmarshalling when loading specs
from file causing failures for unexpected fields.
Since the behaviour for HostPath == "" and HostPath == Path are
equivalent, we clear HostPath if it is equal to Path to ensure
compatibility with the widest range of specs.
This allows, for example, a v0.4.0 spec to be generated as required.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change replaces the `--json` flag of the nvidia-ctk cdi generate
command with a --format flag that accepts a string format of either
json or yaml.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change extends the support for multiple envvars when
specifying swarm resources to consider ALL of the specified
environment variables instead of the first match.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Inject DRM device nodes into containers when Graphics or Display capabilities are requested
See merge request nvidia/container-toolkit/container-toolkit!235
This change adds the discovery of DRM devices associated with requested
devices. This means that the /dev/dri/card* and /dev/dri/renderD*
devices associated with each requested NVIDIA GPU are injected into
the container and that the /dev/dri/by-path symlinks associated with
these devices are created in the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for filtering entities by specifying a filter.
This can be used, for example, to check whether a mount or device
has a particular property and removing it from the set of discovered
entities if it does not.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Devices abstraction to the CUDA image utilities. This
allows for checking whether a devices is selected, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Add nvidia-ctk hook chmod command to set permissions and ensure permissions of `/dev/nvidia-caps` is set
See merge request nvidia/container-toolkit/container-toolkit!232
This change generates one or more createContainer hooks for ensuring
that subfolders in /dev have the required permissions in the container.
As an example, a user requires read permissions to the /dev/nvidia-caps
in addition to including the specific caps devices under this folder.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk hook chmod command that can be used
to update the permissions for paths in the container.
This prepends the container root to the paths to allow these to be
updated by runtime executables.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the CDI spec mounts the ipc sockets with the
noexec flag to allow these to function in rootless mode with podman.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the docker config update for simplicitly.
This also allows for the API to match the crio update code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This adds support for updating crio configs (instead of installing hooks)
and adds crio support to the nvidia-ctk runtime configure command.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the ordering of internal pipeline dependencies to
ensure that the correct rules are applied.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change includes meta devices (e.g. /dev/nvidiactl) in the
generated CDI spec. Missing device nodes are ignored.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change generates a v0.4.0 CDI spec instead of a v0.5.0 spec.
This allows older versions of podman, for example, to be used.
This requires that the device names do not start on a numeric character
and that the HostPath for a device is unspecified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the swarm-resource config option to specify a
comma-separated list of environment variables instead of a single
environment variable.
The first environment variable matched is considered and other
environment variables are ignored.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds functionality to generate CDI specifications
for all devices detected on the system. A specification containing
all GPUs and MIG devices is generated. All libraries on the host
ldcache that have an NVIDIA Driver Version suffix are included as
are the required binaries and IPC sockets.
A hook (based on the nvidia-ctk hook subcommand) to update the ldcache
in the container for the libraries being injected is also added to the
CDI specificiation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the NVIDIA Container Runtime to inject vulkan
loaders and libraries by modifying the OCI runtime specification.
This allows vulkan applications to run in containers without
additional modifications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Locator that can be used to locate libraries.
If library names are specified, the ldcache is searched otherwise
symlinks are resolved.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that a more concrete error is provided by the NVIDIA
Container Runtime if the NVIDIA Container Runtime hook cannot be
located.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the destination / root to be set as the
first positional argument OR as a command line flag. This
allows for the GPU Operator to transition to a case where
on the flag / envvar is used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This reuses the docker file for yum-based rpm distros (centos, amazonlinux)
instead of maintaining two files with the same contents.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change splits the nvidia-container-toolkit package into the top-level package and
an nvidia-container-toolkit-base package.
The nvidia-container-toolkit-base package allows the NVIDIA Container Runtime and
NVIDIA Container Toolkit CLI to be installed on systems without requiring that the
NVIDIA Container Runtine Hook and the transitive dependencies included in the NVIDIA
Container Library and NVIDIA Container CLI also be installed.
This allows the runtime to be used on systems where the CSV or CDI mode of the runtime
is used exclusively.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a modifier to that injects the tegra platform files
* /etc/nv_tegra_release
* /sys/devices/soc0/family
allowing these files to be used for platform detection in a containerized
context such as the GPU device plugin.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the
* accept-nvidia-visible-devices-envvar-when-unprivileged
* accept-nvidia-visible-devices-as-volume-mounts
options to be set in the toolkit-container. These are controlled
by command line flags or the following environment variables:
* ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
* ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a root member to the mounts type that is used to
perform most of the lookups for files and devices. This allows
for consistent handling of relative paths.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change improves the error message when invoking the NVIDIA
Runtime Hook in non-legacy mode. This should guide users to specifying
the --runtime=nvidia flag when using docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The Relative method added to the Locator interface was
not correctly implemented in the file type. The root was
never set when instantiating the object.
This change removes this method from the interface and the file
type, switching to a local implementation in the mounts type
instead.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the creation of symlinks may include other libraries / folders
the ldcache should be updated AFTER the symlinks are created.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a `runtime configure` command to the nvidia-ctk CLI. This
command is currently limited to configuring the docker config on the
system by modifying the daemon.json config file associated with docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In preparation for adding a command to the nvidia-ctk CLI to modify
the docker config, this change refactors load, update, and flush logic
from the toolkit container docker CLI to an internal package.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.modes.cdi.spec-dirs
config option that allows the default spec dirs to be overridden.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change reuse the code that checks for the existing NVIDIA
Container Runtime hook to ensure that both nvidia-container-toolkit
and nvidia-container-runtime-hook are detected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the nvidia-container-toolkit executable
to nvidia-container-runtime-hook. Here nvidia-container-toolkit
is created as a symlink to nvidia-container-runtime-hook.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
For rc releases we allow nvidia-container-toolkit versions
to not match libnvidia-container versions. This change ensures
that only changed packages are released.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This mirrors what is done in cri-o and allows for devices nodes
from, for example, the driver container to be injected into a
container at /dev instead of <ROOT>/dev
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a charDevices discoverer and using this
for CSV, GDS, and MOFED discovery. Internally the discoverer
is a "mounts" discoverer with a charDevice locator.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Instead of creating a set of discoverers per file, this change creates
a discoverer per type by first concatenating the mount specifications
from all files. This will allow all device nodes, for example, to
be treated as a single device.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This adds a Relative function to the Locator interface and uses
this to determine the host and container paths for located files
(and devices). This ensures that the root (e.g. the nvidia driver
root) is stripped from the container path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change creates GDS and MOFED modifiers and adds them to the
modifer created for the selected runtime mode if the NVIDIA_GDS
and NVIDIA_MOFED envvars are set to "enabled", respectively.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses modifier compositioning and the discoverModifier to
refactor the existing CSV modifier.
This change adds a discoverModifier to the internal/modifier package and
refactors the CSV modifier to use this abstraction.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change prevents errors when downloading ubuntu repos on
amd64 architectures. The `stable` images were last pushed
2 years ago.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This ensures that older versions of containerd that may be expecting
this over options.BinaryName should continue to work.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This fixes a bug where the runtime path for v1 containerd configs
was specified in the options.Runtime setting (which is used
for the default runtime) instead of options.BinaryName.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the variables used to construct the
version strings for CMDs are set in the makefile.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds version output to the nvidia-continer-runtime,
nvidia-container-toolkit, and nvidia-ctk CLIs. The same version
is used in all cases and includes a version string and a git
revision if set.
The construction of the version string mirrors what is done in runc.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes replace the nvidia-container-runtime config options
experimental and discover-mode with a single mode config option.
Note that mode is now a string with a default value of "auto"
and a mode value of "legacy" is equivalent to experimental == false.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the create-symlinks hook to also create symlinks for
libcuda.so, libGLX_indirect.so.0, and libnvidia-opticalflow.so
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a GetContainerRoot to the oci.State type to
encapsulate the logic around determining the container root.
This Fixes a bug where relative roots (e.g. as generated by contianerd)
are not supported.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change processes and supports runc logging command line arguments.
This allows for better integration into container engines such as
docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This also removes a test that invokes nvidia-container-runtime run --bundle
expecting an error. This test is no longer valid since this command line
is forwared to runc where the error should be detected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Requirements abstraction that can be used to check
an images' NVIDIA_REQUIRE_* envvars against the host properties such
as CUDA version or architecture.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a CUDA image abstraction that encapsulates
the queries performed on a container image (e.g. envvars) to
check certain CUDA properties.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a DefaultExecutableDir = /usr/bin constant that is used
to construct default paths for executables instead of specifying these
explicitly.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a cache to the mounts type. This means that if called to get
a list of folders, for example, the result is reused instead of recalculated.
This also avoids duplicate logging.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a discovered hook for updating the ldcache as a container-create
hook. The mounts from a discoverer are inspected to determine the folders that must
be added to the cache using the nvidia-ctk hook update-ldcache command.
This is added to the "csv" discovery mode for the experimental runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk CLI that is used as the basis for
utilities related to the NVIDIA Container Toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an 'auto' discover mode that attempts to select the correct mode
for a given platform. This currently attempts to detect whether the platform is a
Tegra-based system in which case the 'csv' discover mode is used. The 'legacy'
discover mode is used as the fallback.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that by default, the CSV discovery only considers the base CSV
files (l4t.csv, drivers.csv, devices.csv) and skips the rest unless the
NVIDIA_REQUIRE_JETPACK is set to "csv-mounts=all", in which case, all CSV files in the
specified folder are considered.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for a "csv" discovery mode to the experimental runtime.
If this is set with experimental = true, a CSV-based discovery of devices and
mounts are used to define the modifications required to the OCI spec. The edits
are expressed as CDI ContainerEdits.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a symlink locator that follows symlinks and returns all
elements in the chain and a device locator that finds character devices.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables the experimental mode of the NVIDIA Container Runtime. If
enabled, the nvidia-container-runtime.discover-mode config option is
queried to determine how required OCI spec modifications should be defined.
If "legacy" is selected, the existing NVIDIA Container Runtime hooks is
discovered and injected into the OCI spec.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an experimental option to the NVIDIA Container Runtime config. To
simplify the extension of this experimental mode in future an error is raised if
this is enabled.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Allows CI/CD environment variables to quickly disable any release job derived from the .release:external template
Template Usage: DRYRUN_RELEASE set to a value to echo docker and regctl commands in Makefile without running them (dry-run) SKIP_RELEASE set to a value to remove the job from the pipeline.
CI/CD Usage: NGC_SKIP_RELEASE set to disable external release to NGC. DOCKERHUB_SKIP_RELEASE set to disable external release to DH. NGC_DRYRUN_RELEASE set to dry-run external release to NGC. DOCKERHUB_DRYRUN_RELEASE set to dry-run external release to DH.
This change moves the code defining the insertion of the nvidia-container-runtime
hook to a separate package. This allows for better distinction between the existing
and experimental modifications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change imports the modifying runtime abstraction from the
experimental branch. This encapsulates the checks for whether
modification is required, and forwards the loaded spec to
the specified modifier. This allows for the same code to be
reused when performing more complex modifications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes unneeded logging and renames the return error value to rerr
to avoid it being aliased by local error values.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This is required to ensure that a newer version of
github.com/opencontainers/runtime-tools/generate is imported for use
with CDI.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change make the following version bumps:
* nvidia-container-toolkit to 1.10.0-rc.1
* nvidia-contianer-runtime to 3.10.0-rc.1
* nvidia-docker to 2.10.0-rc.1
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --restart-mode option to the docker config CLI.
This mirrors the option added for containerd and allows 'none' to be
specified to disable the restart of docker. This is useful in
cases where the updated docker config should be reloaded out of
band.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds arm64/aarch64 images to supported distributions.
This is triggered if BUILD_MULTI_ARCH_IMAGE=true.
Note that for ubi8 images this means that we switch to using centos8
packages instead of centos7 since we do not build aarch64 packages
for the latter.
This also means that for centos7 we only build x86_64 images.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows for docker buildx to be used to build container
images. This also allows multi-arch images being built.
In addition to using docker buildx to build images, regctl as a
replacement for the docker push command to release images. This
tool also supports regctl.
The selection of docker buildx (and regctl) is controlled by a
BUILD_MULTI_ARCH_IMAGES make variable. If this is 'true',
the build-% make targets for the toolkit container will be
run through buildx and the equivalent push-% targets will trigger
a regctl command.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change make the following version bumps:
* nvidia-container-toolkit to 1.8.1
* nvidia-contianer-runtime to 3.8.1
* nvidia-docker to 2.9.1
Signed-off-by: Evan Lezar <elezar@nvidia.com>
As of the NVIDIA Container Toolkit 1.8.0-rc.1 the libnvida-container*
packages also provide a libnvidia-container-go library. This must also
be installed in the toolkit container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the CVE_UPGRADES build arg to be set
to address CVEs in base images instead of requesting waivers.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables the release of toolkit-container images from this
repository instead of the container-config repository. This ensures
that these images are released along with the packages for the
NVIDIA Contianer Toolkit components.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change pulls images from public staging repositories to scan
and release. This ensures that the bits built and tested in public
CI (off the master branch, for example) match those scanned and
released. This also serves to reduce the load on our internal CI
runners as these don't have to store artifacts and build images.
Two CI variables: STAGING_REGISTRY and STAGING_VERSION are used
to control which image is pulled for release, with the latter
defaulting to the CI_COMMIT_SHORT_SHA.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne adds a jetpack-specific config.toml file which specifies
supported-driver-capabilities to remove the unsupported ngx capability.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for a supported-driver-capabilities config
option in the config.toml file that allows the driver capabilities
associated with the NVIDIA_DRIVER_CAPABILITIES=all environment variable.
This can be used on platforms such as Jetson to remove unsupported
capabilities such as "ngx".
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the imported OCI runtime spec to a3c33d663ebc which includes
the ability to override the return code for syscalls. This is used by docker for
the clone3 syscall, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds CI definitions for building the toolkit-container
images. This modifies the existing CI and replaces the build-one
stage with multiple stages that do the following:
* peform the standard golang checks
* build the packages required by the images
* build the images for supported platforms
* releases the images (currently to the CI staging registry)
The build-all stage is included as a final step in the CI. This is
run after the release stage as the target platforms are not requried
from an imaging perspective. The build-all stage is only run on
MRs or tagged builds.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds platform-specific Dockerfiles and a Makefile
to build the toolkit-container images.
This image builds the container-config commands from the tools
directory and installs the components of the NVIDIA Container Toolkit
directly from the nvidia-container-toolkit and libnvidia-container*
packages in the dist directory.
This includes make targets for the centos7, centos8, ubuntu18.04,
and ubi8 container-toolkit images as well as the container tests
make targets implemented in the contianer-config repository.
Files adapted from:
383587f766
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change copies the code from container-config/cmd to
tools/container. This allows the code to be built and
added to the container image without additional refactoring.
As the configuration utilities are incorporated into the cmds
of the nvidia-container-toolkit, the code will be moved from tools.
Files copied from:
383587f766
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows for upgrade workflows to be tested in the
release test containers. To achieve this a script is added
to configure the test repositories leaving the defaults installed
initially.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the build image directly in CI instead of
using dind and invoking the docker-* make targets.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the git submodules for nvidia-docker and
nvidia-container-runtime to contain the package fixes and
code cleanup.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that at least the same libnvidia-container-tools
version is required when installing nvidia-container-toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The relationship between packages also considers the package revision
when determining validity. This means that 3.5.0-1 is considered
greater than 3.5.0. This changed adds the package revision to the
nvidia-container-runtime breaks / replaces relationship.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change copies the cmd/nvidia-container-runtime, internal, and test
folders from github.com/NVIDIA/nvidia-container-runtime@8a63b4b34f3ce3b4167f0516aa3f7207ca280dfb
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for the NVIDIA_FABRIC_DEVICES envvar. The (non-empty)
value of this envvar is passed to the NVIDIA Container CLI using the --fabric-device
command line flag and allows for nvswitch and nvlink devices to be mounted
into the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change improves the CI for the container toolkit. The go targets are
executed in a docker container which allows for reproducible behaviour on
local systems as well as CI. The Makefile is updated to facilitate this.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change moves the pkg folder to `cmd/nvidia-container-toolkit` to
better match go best practices. This allows, for example, for the
`cmd/nvidia-container-toolkit` to be go installed.
The only package included in `pkg` was `main`.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change fixes a bug where the value of NVIDIA_VISIBLE_DEVICES would be used to
select devices even if the `swarm-resource` config option is specified.
Note that this does not change the value of NVIDIA_VISIBLE_DEVICES in the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change simplifies the build process by only targetting ubuntu20.04-amd64
and adds logic to push tagged builds to artifactory.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds check targets for Golang to the make file. These are also
added as stages to the to the Jenkinsfile definition and the GitLab CI
is modified to use them too.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ignores the value of NVIDIA_VISIBLE_DEVICES instead of
raising an error when launching a container with insufficient permissions.
This changes the behaviour under the following conditions:
NVIDIA_VISIBLE_DEVICES is set
and
accept-nvidia-visible-devices-envvar-when-unprivileged = false (default: true)
or
privileged = false (default: false)
This means that a user need not explicitly clear the NVIDIA_VISIBLE_DEVICES
environment variable if no GPUs are to be used in unprivileged containers.
Note that this envvar is set to 'all' by default in many CUDA images that
are used as base images.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
For most practical purposes, it should be fine to set
NVIDIA_DRIVER_CAPABILITIES=all nowadays.
Historically, these different capabilities exist because they were added
incrementally, with varying degrees of stability. It's fairly common to
run with GPUs in containers today, but a few years ago the driver didn't
support them very well, and it was important to make sure the libraries
being injected into the container actually worked in a containerized
environment. When they didn't, it was common to get information leaks,
crashes, or even silent failures.
In the past, whenever a new set of libraries was being vetted for
injected, a new capability was added to make sure that users had control
to explicitly include only those libraries they were comfortable having
injected into their containers.
The idea being that whoever puts together a container image for use with
GPUs should have the knowledge of what capabilities the software in that
container image requires, and can set the NVIDIA_DRIVER_CAPABILITIES
envvar in that image appropriately.
After some back and forth, we've decided it doesn't quite make sense to
set it to "all" just yet, but we should set it to "utility, compute"
instead of just "utility", so that at least the core CUDA libraries work
by default (once installed in the container).
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2020-12-07 12:10:23 +00:00
1074 changed files with 388020 additions and 1715 deletions
* Skip update of ldcache in containers without ldconfig. The .so.SONAME symlinks are still created.
* Normalize ldconfig path on use. This automatically adjust the ldconfig setting applied to ldconfig.real on systems where this exists.
* Include `nvidia/nvoptix.bin` in list of graphics mounts.
* Include `vulkan/icd.d/nvidia_layers.json` in list of graphics mounts.
* Add support for `--library-search-paths` to `nvidia-ctk cdi generate` command.
* Add support for injecting /dev/nvidia-nvswitch* devices if the NVIDIA_NVSWITCH=enabled envvar is specified.
* Added support for `nvidia-ctk runtime configure --enable-cdi` for the `docker` runtime. Note that this requires Docker >= 25.
* Fixed bug in `nvidia-ctk config` command when using `--set`. The types of applied config options are now applied correctly.
* Add `--relative-to` option to `nvidia-ctk transform root` command. This controls whether the root transformation is applied to host or container paths.
* [libnvidia-container] Fix device permission check when using cgroupv2 (fixes #227)
## v1.14.3
* [toolkit-container] Bump CUDA base image version to 12.2.2.
## v1.14.2
* Fix bug on Tegra-based systems where symlinks were not created in containers.
* Add --csv.ignore-pattern command line option to nvidia-ctk cdi generate command.
## v1.14.1
* Fixed bug where contents of `/etc/nvidia-container-runtime/config.toml` is ignored by the NVIDIA Container Runtime Hook.
* [libnvidia-container] Use libelf.so on RPM-based systems due to removed mageia repositories hosting pmake and bmake.
## v1.14.0
* Promote v1.14.0-rc.3 to v1.14.0
## v1.14.0-rc.3
* Added support for generating OCI hook JSON file to `nvidia-ctk runtime configure` command.
* Remove installation of OCI hook JSON from RPM package.
* Refactored config for `nvidia-container-runtime-hook`.
* Added a `nvidia-ctk config` command which supports setting config options using a `--set` flag.
* Added `--library-search-path` option to `nvidia-ctk cdi generate` command in `csv` mode. This allows folders where
libraries are located to be specified explicitly.
* Updated go-nvlib to support devices which are not present in the PCI device database. This allows the creation of dev/char symlinks on systems with such devices installed.
* Added `UsesNVGPUModule` info function for more robust platform detection. This is required on Tegra-based systems where libnvidia-ml.so is also supported.
* [toolkit-container] Set `NVIDIA_VISIBLE_DEVICES=void` to prevent injection of NVIDIA devices and drivers into the NVIDIA Container Toolkit container.
## v1.14.0-rc.2
* Fix bug causing incorrect nvidia-smi symlink to be created on WSL2 systems with multiple driver roots.
* Remove dependency on coreutils when installing package on RPM-based systems.
* Create ouput folders if required when running `nvidia-ctk runtime configure`
* Generate default config as post-install step.
* Added support for detecting GSP firmware at custom paths when generating CDI specifications.
* Added logic to skip the extraction of image requirements if `NVIDIA_DISABLE_REQUIRES` is set to `true`.
* [libnvidia-container] Include Shared Compiler Library (libnvidia-gpucomp.so) in the list of compute libaries.
* [toolkit-container] Ensure that common envvars have higher priority when configuring the container engines.
* [toolkit-container] Bump CUDA base image version to 12.2.0.
* [toolkit-container] Remove installation of nvidia-experimental runtime. This is superceded by the NVIDIA Container Runtime in CDI mode.
## v1.14.0-rc.1
* Add support for updating containerd configs to the `nvidia-ctk runtime configure` command.
* Create file in `etc/ld.so.conf.d` with permissions `644` to support non-root containers.
* Generate CDI specification files with `644` permissions to allow rootless applications (e.g. podman)
* Add `nvidia-ctk cdi list` command to show the known CDI devices.
* Add support for generating merged devices (e.g. `all` device) to the nvcdi API.
* Use *.* pattern to locate libcuda.so when generating a CDI specification to support platforms where a patch version is not specified.
* Update go-nvlib to skip devices that are not MIG capable when generating CDI specifications.
* Fix bug in creation of `/dev/char` symlinks by failing operation if kernel modules are not loaded.
* Add option to load kernel modules when creating device nodes
* Add option to create device nodes when creating `/dev/char` symlinks
* [libnvidia-container] Support OpenSSL 3 with the Encrypt/Decrypt library
* [toolkit-container] Allow same envars for all runtime configs
## v1.13.1
* Update `update-ldcache` hook to only update ldcache if it exists.
* Update `update-ldcache` hook to create `/etc/ld.so.conf.d` folder if it doesn't exist.
* Fix failure when libcuda cannot be located during XOrg library discovery.
* Fix CDI spec generation on systems that use `/etc/alternatives` (e.g. Debian)
## v1.13.0
* Promote 1.13.0-rc.3 to 1.13.0
## v1.13.0-rc.3
* Only initialize NVML for modes that require it when runing `nvidia-ctk cdi generate`.
* Prefer /run over /var/run when locating nvidia-persistenced and nvidia-fabricmanager sockets.
* Fix the generation of CDI specifications for management containers when the driver libraries are not in the LDCache.
* Add transformers to deduplicate and simplify CDI specifications.
* Generate a simplified CDI specification by default. This means that entities in the common edits in a spec are not included in device definitions.
* Also return an error from the nvcdi.New constructor instead of panicing.
* Detect XOrg libraries for injection and CDI spec generation.
* Add `nvidia-ctk system create-device-nodes` command to create control devices.
* Add `nvidia-ctk cdi transform` command to apply transforms to CDI specifications.
* Add `--vendor` and `--class` options to `nvidia-ctk cdi generate`
* [libnvidia-container] Fix segmentation fault when RPC initialization fails.
* [libnvidia-container] Build centos variants of the NVIDIA Container Library with static libtirpc v1.3.2.
* [libnvidia-container] Remove make targets for fedora35 as the centos8 packages are compatible.
* [toolkit-container] Add `nvidia-container-runtime.modes.cdi.annotation-prefixes` config option that allows the CDI annotation prefixes that are read to be overridden.
* [toolkit-container] Create device nodes when generating CDI specification for management containers.
* [toolkit-container] Add `nvidia-container-runtime.runtimes` config option to set the low-level runtime for the NVIDIA Container Runtime
## v1.13.0-rc.2
* Don't fail chmod hook if paths are not injected
* Only create `by-path` symlinks if CDI devices are actually requested.
* Fix possible blank `nvidia-ctk` path in generated CDI specifications
* Fix error in postun scriplet on RPM-based systems
* Only check `NVIDIA_VISIBLE_DEVICES` for environment variables if no annotations are specified.
* Add `cdi.default-kind` config option for constructing fully-qualified CDI device names in CDI mode
* Add support for `accept-nvidia-visible-devices-envvar-unprivileged` config setting in CDI mode
* Add `nvidia-container-runtime-hook.skip-mode-detection` config option to bypass mode detection. This allows `legacy` and `cdi` mode, for example, to be used at the same time.
* Add support for generating CDI specifications for GDS and MOFED devices
* Ensure CDI specification is validated on save when generating a spec
* Rename `--discovery-mode` argument to `--mode` for `nvidia-ctk cdi generate`
* [libnvidia-container] Fix segfault on WSL2 systems
* [toolkit-container] Add `--cdi-enabled` flag to toolkit config
* [toolkit-container] Install `nvidia-ctk` from toolkit container
* [toolkit-container] Use installed `nvidia-ctk` path in NVIDIA Container Toolkit config
* [toolkit-container] Bump CUDA base images to 12.1.0
* [toolkit-container] Set `nvidia-ctk` path in the
* [toolkit-container] Add `cdi.k8s.io/*` to set of allowed annotations in containerd config
* [toolkit-container] Generate CDI specification for use in management containers
* [toolkit-container] Install experimental runtime as `nvidia-container-runtime.experimental` instead of `nvidia-container-runtime-experimental`
* [toolkit-container] Install and configure mode-specific runtimes for `cdi` and `legacy` modes
## v1.13.0-rc.1
* Include MIG-enabled devices as GPUs when generating CDI specification
* Fix missing NVML symbols when running `nvidia-ctk` on some platforms [#49]
* Add CDI spec generation for WSL2-based systems to `nvidia-ctk cdi generate` command
* Add `auto` mode to `nvidia-ctk cdi generate` command to automatically detect a WSL2-based system over a standard NVML-based system.
* Add mode-specific (`.cdi` and `.legacy`) NVIDIA Container Runtime binaries for use in the GPU Operator
* Discover all `gsb*.bin` GSP firmware files when generating CDI specification.
* Align `.deb` and `.rpm` release candidate package versions
* Remove `fedora35` packaging targets
* [libnvidia-container] Include all `gsp*.bin` firmware files if present
* [libnvidia-container] Align `.deb` and `.rpm` release candidate package versions
* [toolkit-container] Install `nvidia-container-toolkit-operator-extensions` package for mode-specific executables.
* [toolkit-container] Allow `nvidia-container-runtime.mode` to be set when configuring the NVIDIA Container Toolkit
## v1.12.0
* Promote `v1.12.0-rc.5` to `v1.12.0`
* Rename `nvidia cdi generate``--root` flag to `--driver-root` to better indicate intent
* [libnvidia-container] Add nvcubins.bin to DriverStore components under WSL2
* [toolkit-container] Bump CUDA base images to 12.0.1
## v1.12.0-rc.5
* Fix bug here the `nvidia-ctk` path was not properly resolved. This causes failures to run containers when the runtime is configured in `csv` mode or if the `NVIDIA_DRIVER_CAPABILITIES` includes `graphics` or `display` (e.g. `all`).
## v1.12.0-rc.4
* Generate a minimum CDI spec version for improved compatibility.
* Add `--device-name-strategy` options to the `nvidia-ctk cdi generate` command that can be used to control how device names are constructed.
* Set default for CDI device name generation to `index` to generate device names such as `nvidia.com/gpu=0` or `nvidia.com/gpu=1:0` by default.
## v1.12.0-rc.3
* Don't fail if by-path symlinks for DRM devices do not exist
* Replace the --json flag with a --format [json|yaml] flag for the nvidia-ctk cdi generate command
* Ensure that the CDI output folder is created if required
* When generating a CDI specification use a blank host path for devices to ensure compatibility with the v0.4.0 CDI specification
* Add injection of Wayland JSON files
* Add GSP firmware paths to generated CDI specification
* Add --root flag to nvidia-ctk cdi generate command
## v1.12.0-rc.2
* Inject Direct Rendering Manager (DRM) devices into a container using the NVIDIA Container Runtime
* Improve logging of errors from the NVIDIA Container Runtime
* Improve CDI specification generation to support rootless podman
* Use `nvidia-ctk cdi generate` to generate CDI specifications instead of `nvidia-ctk info generate-cdi`
* [libnvidia-container] Skip creation of existing files when these are already mounted
## v1.12.0-rc.1
* Add support for multiple Docker Swarm resources
* Improve injection of Vulkan configurations and libraries
* Add `nvidia-ctk info generate-cdi` command to generated CDI specification for available devices
* [libnvidia-container] Include NVVM compiler library in compute libs
## v1.11.0
* Promote v1.11.0-rc.3 to v1.11.0
## v1.11.0-rc.3
* Build fedora35 packages
* Introduce an `nvidia-container-toolkit-base` package for better dependency management
* Fix removal of `nvidia-container-runtime-hook` on RPM-based systems
* Inject platform files into container on Tegra-based systems
* [toolkit container] Update CUDA base images to 11.7.1
* [libnvidia-container] Preload libgcc_s.so.1 on arm64 systems
## v1.11.0-rc.2
* Allow `accept-nvidia-visible-devices-*` config options to be set by toolkit container
* [libnvidia-container] Fix bug where LDCache was not updated when the `--no-pivot-root` option was specified
## v1.11.0-rc.1
* Add discovery of GPUDirect Storage (`nvidia-fs*`) devices if the `NVIDIA_GDS` environment variable of the container is set to `enabled`
* Add discovery of MOFED Infiniband devices if the `NVIDIA_MOFED` environment variable of the container is set to `enabled`
* Fix bug in CSV mode where libraries listed as `sym` entries in mount specification are not added to the LDCache.
* Rename `nvidia-container-toolkit` executable to `nvidia-container-runtime-hook` and create `nvidia-container-toolkit` as a symlink to `nvidia-container-runtime-hook` instead.
* Add `nvidia-ctk runtime configure` command to configure the Docker config file (e.g. `/etc/docker/daemon.json`) for use with the NVIDIA Container Runtime.
## v1.10.0
* Promote v1.10.0-rc.3 to v1.10.0
## v1.10.0-rc.3
* Use default config instead of raising an error if config file cannot be found
* Ignore NVIDIA_REQUIRE_JETPACK* environment variables for requirement checks
* Fix bug in detection of Tegra systems where `/sys/devices/soc0/family` is ignored
* Fix bug where links to devices were detected as devices
* [libnvida-container] Fix bug introduced when adding libcudadebugger.so to list of libraries
## v1.10.0-rc.2
* Add support for NVIDIA_REQUIRE_* checks for cuda version and arch to csv mode
* Switch to debug logging to reduce log verbosity
* Support logging to logs requested in command line
* Fix bug when launching containers with relative root path (e.g. using containerd)
* Allow low-level runtime path to be set explicitly as nvidia-container-runtime.runtimes option
* Fix failure to locate low-level runtime if PATH envvar is unset
* Replace experimental option for NVIDIA Container Runtime with nvidia-container-runtime.mode = csv option
* Use csv as default mode on Tegra systems without NVML
* Add --version flag to all CLIs
* [libnvidia-container] Bump libtirpc to 1.3.2
* [libnvidia-container] Fix bug when running host ldconfig using glibc compiled with a non-standard prefix
* [libnvidia-container] Add libcudadebugger.so to list of compute libraries
## v1.10.0-rc.1
* Include nvidia-ctk CLI in installed binaries
* Add experimental option to NVIDIA Container Runtime
## v1.9.0
* [libnvidia-container] Add additional check for Tegra in /sys/.../family file in CLI
* [libnvidia-container] Update jetpack-specific CLI option to only load Base CSV files by default
* [libnvidia-container] Fix bug (from 1.8.0) when mounting GSP firmware into containers without /lib to /usr/lib symlinks
* [libnvidia-container] Update nvml.h to CUDA 11.6.1 nvML_DEV 11.6.55
* [libnvidia-container] Update switch statement to include new brands from latest nvml.h
* [libnvidia-container] Process all --require flags on Jetson platforms
* [libnvidia-container] Fix long-standing issue with running ldconfig on Debian systems
## v1.8.1
* [libnvidia-container] Fix bug in determining cgroup root when running in nested containers
* [libnvidia-container] Fix permission issue when determining cgroup version
## v1.8.0
* Promote 1.8.0-rc.2-1 to 1.8.0
## v1.8.0-rc.2
* Remove support for building amazonlinux1 packages
## v1.8.0-rc.1
* [libnvidia-container] Add support for cgroupv2
* Release toolkit-container images from nvidia-container-toolkit repository
## v1.7.0
* Promote 1.7.0-rc.1-1 to 1.7.0
* Bump Golang version to 1.16.4
## v1.7.0-rc.1
* Specify containerd runtime type as string in config tools to remove dependency on containerd package
* Add supported-driver-capabilities config option to allow for a subset of all driver capabilities to be specified
## v1.6.0
* Promote 1.6.0-rc.3-1 to 1.6.0
* Fix unnecessary logging to stderr instead of configured nvidia-container-runtime log file
## v1.6.0-rc.3
* Add supported-driver-capabilities config option to the nvidia-container-toolkit
* Move OCI and command line checks for runtime to internal oci package
## v1.6.0-rc.2
* Use relative path to OCI specification file (config.json) if bundle path is not specified as an argument to the nvidia-container-runtime
## v1.6.0-rc.1
* Add AARCH64 package for Amazon Linux 2
* Include nvidia-container-runtime into nvidia-container-toolkit package
## v1.5.1
* Fix bug where Docker Swarm device selection is ignored if NVIDIA_VISIBLE_DEVICES is also set
* Improve unit testing by using require package and adding coverage reports
* Remove unneeded go dependencies by running go mod tidy
* Move contents of pkg directory to cmd for CLI tools
* Ensure make binary target explicitly sets GOOS
## v1.5.0
* Add dependence on libnvidia-container-tools >= 1.4.0
* Add golang check targets to Makefile
* Add Jenkinsfile definition for build targets
* Move docker.mk to docker folder
## v1.4.2
* Add dependence on libnvidia-container-tools >= 1.3.3
## v1.4.1
* Ignore NVIDIA_VISIBLE_DEVICES for containers with insufficent privileges
* Add dependence on libnvidia-container-tools >= 1.3.2
## v1.4.0
* Add 'compute' capability to list of defaults
* Add dependence on libnvidia-container-tools >= 1.3.1
## v1.3.0
* Promote 1.3.0-rc.2-1 to 1.3.0
* Add dependence on libnvidia-container-tools >= 1.3.0
## v1.3.0-rc.2
* 2c180947 Add more tests for new semantics with device list from volume mounts
* 7c003857 Refactor accepting device lists from volume mounts as a boolean
## v1.3.0-rc.1
* b50d86c1 Update build system to accept a TAG variable for things like rc.x
* fe65573b Add common CI tests for things like golint, gofmt, unit tests, etc.
* da6fbb34 Revert "Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_*"
* a7fb3330 Flip build-all targets to run automatically on merge requests
* 8b248b66 Rename github.com/NVIDIA/container-toolkit to nvidia-container-toolkit
* da36874e Add new config options to pull device list from mounted files instead of ENVVAR
## v1.2.1
* 4e6e0ed4 Add 'ngx' to list of*all* driver capabilities
* 2f4af743 List config.toml as a config file in the RPM SPEC
## v1.2.0
* 8e0aab46 Fix repo listed in changelog for debian distributions
* 320bb6e4 Update dependence on libnvidia-container to 1.2.0
* 6cfc8097 Update package license to match source license
* e7dc3cbb Fix debian copyright file
* d3aee3e0 Add the 'ngx' driver capability
## v1.1.2
* c32237f3 Add support for parsing Linux Capabilities for older OCI specs
## v1.1.1
* d202aded Update dependence to libnvidia-container 1.1.1
## v1.1.0
* 4e4de762 Update build system to support multi-arch builds
* fcc1d116 Add support for MIG (Multi-Instance GPUs)
* d4ff0416 Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_*
Where `REPO` is one of `stable` or `experimental`, `PACKAGE_REPO_ROOT` is the local path to the `libnvidia-container` repository checked out to the `gh-pages` branch, and `REFERENCE` is the git SHA that is to be released. If reference is not specified `HEAD` is assumed.
This scripts performs the following basic functions:
* Pulls the package image defined by the `REFERENCE` git SHA from the staging registry,
* Copies the required packages to the package repository at `PACKAGE_REPO_ROOT/REPO`,
* Signs the packages using the specified GPG keys
While the last two are performed, commits are added to the package repository. These can be pushed to the relevant repository.
The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime [library](https://github.com/NVIDIA/libnvidia-container) and utilities to automatically configure containers to leverage NVIDIA GPUs.
Product documentation including an architecture overview, platform support, and installation and usage guides can be found in the [documentation repository](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html).
## Getting Started
**Make sure you have installed the [NVIDIA driver](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#nvidia-drivers) for your Linux Distribution**
**Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed**
For instructions on getting started with the NVIDIA Container Toolkit, refer to the [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide).
## Usage
The [user guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html) provides information on the configuration and command line options available when running GPU containers with Docker.
## Issues and Contributing
[Checkout the Contributing document!](CONTRIBUTING.md)
* Please let us know by [filing a new issue](https://github.com/NVIDIA/nvidia-container-toolkit/issues/new)
* You can contribute by creating a [merge request](https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/new) to our public GitLab repository
log.Panicf("Invalid value for config option '%v'; %v (supported: %v)\n",configName,config.SupportedDriverCapabilities,allSupportedDriverCapabilities.String())
}
returnconfig,nil
}
// getConfigOption returns the toml config option associated with the
log.Panicln("invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.")
The NVIDIA Container Runtime is a shim for OCI-compliant low-level runtimes such as [runc](https://github.com/opencontainers/runc). When a `create` command is detected, the incoming [OCI runtime specification](https://github.com/opencontainers/runtime-spec) is modified in place and the command is forwarded to the low-level runtime.
## Configuration
The NVIDIA Container Runtime uses file-based configuration, with the config stored in `/etc/nvidia-container-runtime/config.toml`. The `/etc` path can be overridden using the `XDG_CONFIG_HOME` environment variable with the `${XDG_CONFIG_HOME}/nvidia-container-runtime/config.toml` file used instead if this environment variable is set.
This config file may contain options for other components of the NVIDIA container stack and for the NVIDIA Container Runtime, the relevant config section is `nvidia-container-runtime`
### Logging
The `log-level` config option (default: `"info"`) specifies the log level to use and the `debug` option, if set, specifies a log file to which logs for the NVIDIA Container Runtime must be written.
In addition to this, the NVIDIA Container Runtime considers the value of `--log` and `--log-format` flags that may be passed to it by a container runtime such as docker or containerd. If the `--debug` flag is present the log-level specified in the config file is overridden as `"debug"`.
### Low-level Runtime Path
The `runtimes` config option allows for the low-level runtime to be specified. The first entry in this list that is an existing executable file is used as the low-level runtime. If the entry is not a path, the `PATH` is searched for a matching executable. If the entry is a path this is checked instead.
The default value for this setting is:
```toml
runtimes=[
"docker-runc",
"runc",
]
```
and if, for example, `crun` is to be used instead this can be changed to:
```toml
runtimes=[
"crun",
]
```
### Runtime Mode
The `mode` config option (default `"auto"`) controls the high-level behaviour of the runtime.
#### Auto Mode
When `mode` is set to `"auto"`, the runtime employs heuristics to determine which mode to use based on, for example, the platform where the runtime is being run.
#### Legacy Mode
When `mode` is set to `"legacy"`, the NVIDIA Container Runtime adds a [`prestart` hook](https://github.com/opencontainers/runtime-spec/blob/master/config.md#prestart) to the incomming OCI specification that invokes the NVIDIA Container Runtime Hook for all containers created. This hook checks whether NVIDIA devices are requested and ensures GPU access is configured using the `nvidia-container-cli` from the [libnvidia-container](https://github.com/NVIDIA/libnvidia-container) project.
#### CSV Mode
When `mode` is set to `"csv"`, CSV files at `/etc/nvidia-container-runtime/host-files-for-container.d` define the devices and mounts that are to be injected into a container when it is created. The search path for the files can be overridden by modifying the `nvidia-container-runtime.modes.csv.mount-spec-path` in the config as below:
This mode is primarily targeted at Tegra-based systems without NVML available.
### Notes on using the docker CLI
Note that only the `"legacy"` NVIDIA Container Runtime mode is directly compatible with the `--gpus` flag implemented by the `docker` CLI (assuming the NVIDIA Container Runtime is not used). The reason for this is that `docker` inserts the same NVIDIA Container Runtime Hook into the OCI runtime specification.
If a different mode is explicitly set or detected, the NVIDIA Container Runtime Hook will raise the following error when `--gpus` is set:
```
$ docker run --rm --gpus all ubuntu:18.04
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime instead.: unknown.
```
Here NVIDIA Container Runtime must be used explicitly. The recommended way to do this is to specify the `--runtime=nvidia` command line argument as part of the `docker run` commmand as follows:
```
$ docker run --rm --gpus all --runtime=nvidia ubuntu:18.04
```
Alternatively the NVIDIA Container Runtime can be set as the default runtime for docker. This can be done by modifying the `/etc/docker/daemon.json` file as follows:
```json
{
"default-runtime":"nvidia",
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[]
}
}
}
```
## Environment variables (OCI spec)
Each environment variable maps to an command-line argument for `nvidia-container-cli` from [libnvidia-container](https://github.com/NVIDIA/libnvidia-container).
These variables are already set in our [official CUDA images](https://hub.docker.com/r/nvidia/cuda/).
### `NVIDIA_VISIBLE_DEVICES`
This variable controls which GPUs will be made accessible inside the container.
#### Possible values
*`0,1,2`, `GPU-fef8089b` …: a comma-separated list of GPU UUID(s) or index(es).
*`all`: all GPUs will be accessible, this is the default value in our container images.
*`none`: no GPU will be accessible, but driver capabilities will be enabled.
*`void` or *empty* or *unset*: `nvidia-container-runtime` will have the same behavior as `runc`.
**Note**: When running on a MIG capable device, the following values will also be available:
*`0:0,0:1,1:0`, `MIG-GPU-fef8089b/0/1` …: a comma-separated list of MIG Device UUID(s) or index(es).
Where the MIG device indices have the form `<GPU Device Index>:<MIG Device Index>` as seen in the example output:
By default, all commands output to `STDOUT`, but specifying the `--output` flag writes the config to the specified file.
### Generate CDI specifications
The [Container Device Interface (CDI)](https://tags.cncf.io/container-device-interface) provides
a vendor-agnostic mechanism to make arbitrary devices accessible in containerized environments. To allow NVIDIA devices to be
used in these environments, the NVIDIA Container Toolkit CLI includes functionality to generate a CDI specification for the
available NVIDIA GPUs in a system.
In order to generate the CDI specification for the available devices, run the following command:\
```bash
nvidia-ctk cdi generate
```
The default is to print the specification to STDOUT and a filename can be specified using the `--output` flag.
The specification will contain a device entries as follows (where applicable):
* An `nvidia.com/gpu=gpu{INDEX}` device for each non-MIG-enabled full GPU in the system
* An `nvidia.com/gpu=mig{GPU_INDEX}:{MIG_INDEX}` device for each MIG-device in the system
* A special device called `nvidia.com/gpu=all` which represents all available devices.
For example, to generate the CDI specification in the default location where CDI-enabled tools such as `podman`, `containerd`, `cri-o`, or the NVIDIA Container Runtime can be configured to load it, the following command can be run:
Usage:"Generate CDI specifications for use with CDI-enabled runtimes",
Before:func(c*cli.Context)error{
returnm.validateFlags(c,&opts)
},
Action:func(c*cli.Context)error{
returnm.run(c,&opts)
},
}
c.Flags=[]cli.Flag{
&cli.StringFlag{
Name:"output",
Usage:"Specify the file to output the generated CDI specification to. If this is '' the specification is output to STDOUT",
Destination:&opts.output,
},
&cli.StringFlag{
Name:"format",
Usage:"The output format for the generated spec [json | yaml]. This overrides the format defined by the output file extension (if specified).",
Value:spec.FormatYAML,
Destination:&opts.format,
},
&cli.StringFlag{
Name:"mode",
Aliases:[]string{"discovery-mode"},
Usage:"The mode to use when discovering the available entities. One of [auto | nvml | wsl]. If mode is set to 'auto' the mode will be determined based on the system configuration.",
Value:nvcdi.ModeAuto,
Destination:&opts.mode,
},
&cli.StringFlag{
Name:"dev-root",
Usage:"Specify the root where `/dev` is located. If this is not specified, the driver-root is assumed.",
Destination:&opts.devRoot,
},
&cli.StringFlag{
Name:"device-name-strategy",
Usage:"Specify the strategy for generating device names. One of [index | uuid | type-index]",
Value:nvcdi.DeviceNameStrategyIndex,
Destination:&opts.deviceNameStrategy,
},
&cli.StringFlag{
Name:"driver-root",
Usage:"Specify the NVIDIA GPU driver root to use when discovering the entities that should be included in the CDI specification.",
Destination:&opts.driverRoot,
},
&cli.StringSliceFlag{
Name:"library-search-path",
Usage:"Specify the path to search for libraries when discovering the entities that should be included in the CDI specification.\n\tNote: This option only applies to CSV mode.",
Destination:&opts.librarySearchPaths,
},
&cli.StringFlag{
Name:"nvidia-ctk-path",
Usage:"Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched.",
Destination:&opts.nvidiaCTKPath,
},
&cli.StringFlag{
Name:"vendor",
Aliases:[]string{"cdi-vendor"},
Usage:"the vendor string to use for the generated CDI specification.",
Value:"nvidia.com",
Destination:&opts.vendor,
},
&cli.StringFlag{
Name:"class",
Aliases:[]string{"cdi-class"},
Usage:"the class string to use for the generated CDI specification.",
Value:"gpu",
Destination:&opts.class,
},
&cli.StringSliceFlag{
Name:"csv.file",
Usage:"The path to the list of CSV files to use when generating the CDI specification in CSV mode.",
Usage:"Interact with the NVIDIA Container Toolkit configuration",
Action:func(ctx*cli.Context)error{
returnrun(ctx,&opts)
},
}
c.Flags=[]cli.Flag{
&cli.StringFlag{
Name:"config-file",
Aliases:[]string{"config","c"},
Usage:"Specify the config file to modify.",
Value:config.GetConfigFilePath(),
Destination:&opts.Config,
},
&cli.StringSliceFlag{
Name:"set",
Usage:"Set a config value using the pattern key=value. If value is empty, this is equivalent to specifying the same key in unset. This flag can be specified multiple times",
Destination:&opts.sets,
},
&cli.BoolFlag{
Name:"in-place",
Aliases:[]string{"i"},
Usage:"Modify the config file in-place",
Destination:&opts.InPlace,
},
&cli.StringFlag{
Name:"output",
Aliases:[]string{"o"},
Usage:"Specify the output file to write to; If not specified, the output is written to stdout",
Destination:&opts.Output,
},
}
c.Subcommands=[]*cli.Command{
createdefault.NewCommand(m.logger),
}
return&c
}
funcrun(c*cli.Context,opts*options)error{
cfgToml,err:=config.New(
config.WithConfigFile(opts.Config),
)
iferr!=nil{
returnfmt.Errorf("unable to create config: %v",err)
// config defines the options that can be set for the CLI through config files,
// environment variables, or command line config
typeconfigstruct{
dryRunbool
runtimestring
configFilePathstring
modestring
hookFilePathstring
nvidiaRuntimestruct{
namestring
pathstring
hookPathstring
setAsDefaultbool
}
// cdi-specific options
cdistruct{
enabledbool
}
}
func(mcommand)build()*cli.Command{
// Create a config struct to hold the parsed environment variables or command line flags
config:=config{}
// Create the 'configure' command
configure:=cli.Command{
Name:"configure",
Usage:"Add a runtime to the specified container engine",
Before:func(c*cli.Context)error{
returnm.validateFlags(c,&config)
},
Action:func(c*cli.Context)error{
returnm.configureWrapper(c,&config)
},
}
configure.Flags=[]cli.Flag{
&cli.BoolFlag{
Name:"dry-run",
Usage:"update the runtime configuration as required but don't write changes to disk",
Destination:&config.dryRun,
},
&cli.StringFlag{
Name:"runtime",
Usage:"the target runtime engine; one of [containerd, crio, docker]",
Value:defaultRuntime,
Destination:&config.runtime,
},
&cli.StringFlag{
Name:"config",
Usage:"path to the config file for the target runtime",
Destination:&config.configFilePath,
},
&cli.StringFlag{
Name:"config-mode",
Usage:"the config mode for runtimes that support multiple configuration mechanisms",
Destination:&config.mode,
},
&cli.StringFlag{
Name:"oci-hook-path",
Usage:"the path to the OCI runtime hook to create if --config-mode=oci-hook is specified. If no path is specified, the generated hook is output to STDOUT.\n\tNote: The use of OCI hooks is deprecated.",
Destination:&config.hookFilePath,
},
&cli.StringFlag{
Name:"nvidia-runtime-name",
Usage:"specify the name of the NVIDIA runtime that will be added",
Value:defaultNVIDIARuntimeName,
Destination:&config.nvidiaRuntime.name,
},
&cli.StringFlag{
Name:"nvidia-runtime-path",
Aliases:[]string{"runtime-path"},
Usage:"specify the path to the NVIDIA runtime executable",
Value:defaultNVIDIARuntimeExecutable,
Destination:&config.nvidiaRuntime.path,
},
&cli.StringFlag{
Name:"nvidia-runtime-hook-path",
Usage:"specify the path to the NVIDIA Container Runtime hook executable",
Value:defaultNVIDIARuntimeHookExpecutablePath,
Destination:&config.nvidiaRuntime.hookPath,
},
&cli.BoolFlag{
Name:"nvidia-set-as-default",
Aliases:[]string{"set-as-default"},
Usage:"set the NVIDIA runtime as the default runtime",
Usage:"A utility to create symlinks to possible /dev/nv* devices in /dev/char",
Before:func(c*cli.Context)error{
returnm.validateFlags(c,&cfg)
},
Action:func(c*cli.Context)error{
returnm.run(c,&cfg)
},
}
c.Flags=[]cli.Flag{
&cli.StringFlag{
Name:"dev-char-path",
Usage:"The path at which the symlinks will be created. Symlinks will be created as `DEV_CHAR`/MAJOR:MINOR where MAJOR and MINOR are the major and minor numbers of a corresponding device node.",
Value:defaultDevCharPath,
Destination:&cfg.devCharPath,
EnvVars:[]string{"DEV_CHAR_PATH"},
},
&cli.StringFlag{
Name:"driver-root",
Usage:"The path to the driver root. `DRIVER_ROOT`/dev is searched for NVIDIA device nodes.",
Value:"/",
Destination:&cfg.driverRoot,
EnvVars:[]string{"DRIVER_ROOT"},
},
&cli.BoolFlag{
Name:"watch",
Usage:"If set, the command will watch for changes to the driver root and recreate the symlinks when changes are detected.",
Value:false,
Destination:&cfg.watch,
EnvVars:[]string{"WATCH"},
},
&cli.BoolFlag{
Name:"create-all",
Usage:"Create all possible /dev/char symlinks instead of limiting these to existing device nodes.",
Destination:&cfg.createAll,
EnvVars:[]string{"CREATE_ALL"},
},
&cli.BoolFlag{
Name:"load-kernel-modules",
Usage:"Load the NVIDIA kernel modules before creating symlinks. This is only applicable when --create-all is set.",
Destination:&cfg.loadKernelModules,
EnvVars:[]string{"LOAD_KERNEL_MODULES"},
},
&cli.BoolFlag{
Name:"create-device-nodes",
Usage:"Create the NVIDIA control device nodes in the driver root if they do not exist. This is only applicable when --create-all is set",
Destination:&cfg.createDeviceNodes,
EnvVars:[]string{"CREATE_DEVICE_NODES"},
},
&cli.BoolFlag{
Name:"dry-run",
Usage:"If set, the command will not create any symlinks.",
# Hook for Project Atomic's fork of Docker: https://github.com/projectatomic/docker/tree/docker-1.13.1-rhel#add-dockerhooks-exec-custom-hooks-for-prestartpoststop-containerspatch
# This might not be useful on Amazon Linux, but it's simpler to keep the RHEL
# and Amazon Linux packages identical.
COPY oci-nvidia-hook $DIST_DIR/oci-nvidia-hook
# Hook for libpod/CRI-O: https://github.com/containers/libpod/blob/v0.8.5/pkg/hooks/docs/oci-hooks.5.md
# Hook for Project Atomic's fork of Docker: https://github.com/projectatomic/docker/tree/docker-1.13.1-rhel#add-dockerhooks-exec-custom-hooks-for-prestartpoststop-containerspatch
COPY oci-nvidia-hook $DIST_DIR/oci-nvidia-hook
# Hook for libpod/CRI-O: https://github.com/containers/libpod/blob/v0.8.5/pkg/hooks/docs/oci-hooks.5.md
# Hook for Project Atomic's fork of Docker: https://github.com/projectatomic/docker/tree/docker-1.13.1-rhel#add-dockerhooks-exec-custom-hooks-for-prestartpoststop-containerspatch
COPY oci-nvidia-hook $DIST_DIR/oci-nvidia-hook
# Hook for libpod/CRI-O: https://github.com/containers/libpod/blob/v0.8.5/pkg/hooks/docs/oci-hooks.5.md
// newCreateDRMByPathSymlinks creates a discoverer for a hook to create the by-path symlinks for DRM devices discovered by the specified devices discoverer
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.