This change adds a create-soname-symlinks hook that can be used to ensure
that the soname symlinks for injected libraries exist in a container.
This is done by calling ldconfig -n -N for the directories containing the injected
libraries.
This also ensures that libcuda.so is present in the ldcache when the update-ldcache
hook is run.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the nvidia-container-tools and libnvidia-container* packages
are now released at the same time with the same version, a more
restrictive version makes sense. Here we specifically require an
equal version for the nvidia-container-toolkit* packages and the
libnvidia-container* packages.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an NVIDIA_CTK_LIBCUDA_DIR envvar to
a generated CDI specification. This reports where the `libcuda.so.*`
libraries will be injected into the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that a symlink from /work/nvidia-toolkit to
/work/nvidia-ctk-installer exists to allow GPU Operator versions
that override the entrypoint and assume nvidia-toolkit as the
original entrypoint.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes the NGC-DL-CONTAINER-LICENSE (since this
is not available in the distroless images) and includes the
repo's Apache LICENSE file in the image.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for explicitly specifying the path
to the config file through an environment variable.
The NVIDIA_CTK_CONFIG_FILE_PATH variable is used for both the nvidia-container-runtime
and nvidia-container-runtime-hook.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
We now release all images with vX.Y.Z and vX.Y.Z-packaging tags.
This change updates the gitlab CI to allow for this.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The gated modifiers used to add support for GDS, Mofed, and CUDA Forward Comatibility
only check the NVIDIA_VISIBLE_DEVICES envvar to determine whether GPUs are requested
and modifications should be made. This means that use cases where volume mounts are
used to request devices are not supported.
This change ensures that device extraction is consistent for all use cases.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change swithces to using a single image for the NVIDIA Container Toolkit contianer.
Here the contents of the architecture-specific deb and rpm packages are extracted
to a known root. These contents can then be installed using the updated installation
mechanism which has been updated to detect the source root based on the packaging type.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes docker-runc as the highest priority
default candidate for the low-level runtimes supported by
the nvidia-container-runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change includes annotation devices in CUDA.VisibleDevices
with the highest priority. This allows for the CDI device
request extraction to be consistent across all request mechanisms.
Note that this does change behaviour in the following ways:
1. Annotations are considered when resolving the runtime mode.
2. Incorrectly formed device names in annotations are no longer treated as an error.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Following the refactoring of device request extraction, we can
now make CDI device requests consistent with other methods.
This change moves to using image.VisibleDevices instead of
separate calls to CDIDevicesFromMounts and VisibleDevicesFromEnvVar.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the image.CUDA type to also extract CDI
device requests. These are only relevant IF CDI prefixes are
specifically set.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows device IDs to the specified in the GetSpec API.
This simplifies cases where CDI specs are being generated for specific
devices by ID.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change consolidates the logic for determining requested devices
from the container image. The logic for this has been integrated into
the image.CUDA type so that multiple implementations are not required.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
Automatic regeneration of /var/run/cdi/nvidia.yaml
New units:
• nvidia-cdi-refresh.service – one-shot wrapper for
nvidia-ctk cdi generate (adds sleep + required caps).
• nvidia-cdi-refresh.path – fires on driver install/upgrade via
modules.dep.bin changes.
Packaging
• RPM %post reloads systemd and enables the path unit on fresh
installs.
• DEB postinst does the same (configure, skip on upgrade).
Result: CDI spec is always up to date
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
On some RPM-based platforms, the path of the Vulkan ICD file
include an architecture-specific infix to distinguish it from
other architectures. (Most notably x86_64 vs i686). This change
attempts to discover the arch-specific ICD file in addition to
the standard nvidia_icd.json.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
If required, this hook creates a modified params file (with ModifyDeviceFiles: 0) in a tmpfs
and mounts this over /proc/driver/nvidia/params.
This prevents device node creation when running tools such as nvidia-smi. In general the
creation of these devices is cosmetic as a container does not have the required cgroup
access for the devices.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The GetDriverStorePaths function is implemented so as to
remove duplicate driver store paths which means that the
additional deduplication (which had a bug) can be removed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds the ability to disabled specific (or all) CDI hooks to
both the nvidia-ctk cdi generate command and the nvcdi API.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we now use pr-copy-bot, we are not running image builds
from forks and as such the required secrets for pushing to the
ghcr are present.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows hooks to be configured with debug logging. This
is currently only enabled for the hooks generated from the runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for feature flags to the nvcdi API.
A feature flag to disable nvsandboxutils is also added to allow
more flexibility in cases where this library causes issue.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change reenables nvsandboxutils for driver discovery. This
was disabled due to an error in a specific driver version (v565)
so as to not block the release of the DRA driver for ComputeDomains.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
To allow for CDI hooks to be added gradually we provide a generic no-op hook
for unrecognised subcommands. This will log a warning instead of erroring out.
An unsupported hook could be the result of a CDI specification referring to a
new hook that is not yet supported by an older NVIDIA Container Toolkit
version or a hook that has been removed in newer version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the reexec package to run the update of the
ldcache in a container in a process with isolated namespaces.
Since the hook is invoked as a createContainer hook, these
namespaces are cloned from the container's namespaces.
In the reexec handler, we further isolate the proc filesystem,
mount the host ldconfig to a tmpfs, and pivot into the containers
root.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.modes.legacy.cuda-compat-mode
config option. This can be set to one of four values:
* ldconfig (default): the --cuda-compat-mode=ldconfig flag is passed to the nvidia-container-cli
* mount: the --cuda-compat-mode=mount flag is passed to the nvidia-conainer-cli
* disabled: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli
* hook: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli AND the
enable-cuda-compat hook is used to provide forward compatibility.
Note that the disable-cuda-compat-lib-hook feature flag will prevent the enable-cuda-compat
hook from being used. This change also means that the allow-cuda-compat-libs-from-container
feature flag no longer has any effect.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates github.com/NVIDIA/go-nvlib from v0.7.1 to v0.7.2
to allow Thor systems to be detected as Tegra-based. This allows fixes
automatic mode detection to work on these systems.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This ensures that mount propagation is set to rprivate for
mounts from the host into the container. This aligns with the
default in docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
When constructing a list of discoverers using discover.Merge we
explicitly skip `nil` discoverers to simplify usage as we don't
have to explicitly check validity when processing the discoverers
in the list.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we explicitly check for the architecture of the
libraries in the ldcache, we need to also check the architecture
flag against the ARM constants.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since this is running in a contianer the contents of the
/etc/nvidia-container-runtime/config.toml file is equivalent to the
default config. This change makes it explicit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for specifying the container runtime
executable path. This can be used if, for example, there are
two containerd or crio executables and a specific one must be used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to using the WithCache decorator for
mounts instead of keeping track of a cache locally.
This addresses a race condition when using the mounts structure.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates how the lconfig arguments are constructed. This
makes the update-ldcache more robust and ensures that folders are
specified last if at al.
Checks are also included for empty container roots at the start of the
hook to simplify later checks.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Parsing positional arguments require additional processing
instead of relying on named flags. This change switches to
using a named flag for specifying the toolkit installation directory.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The GPU Operator no longer sets the RUNTIME_ARGS environment variable
to control the runtime and instead sets general environment variables.
This change removes support for setting RUNTIME_ARGS in the CLI as these
settings have no effect.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The urfave update to v2.27.6 fixes the behaviour when disabling a separator
for repeated StringSliceFlags. This change updates the nvidia-ctk config
command to allow list options to be specified as comma-separated values.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The signing container need not be based on a legacy centos version.
This change updates the signing contianer to be centos:stream9 based.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Management containers don't generally need forward compatibility.
We disable the enable-cuda-compat hook to not include this in the
generated CDI specifications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support to the nvcdi package to opt out of specific hooks.
Currently only the `enable-cuda-compat` hook is supported. This allows clients to
generate a CDI spec that is compatible with older nvidia-cdi-hook CLIs.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds the enable-cuda-compat hook to the incomming OCI runtime spec
if the allow-cuda-compat-libs-from-container feature flag is not enabled.
An update-ldcache hook is also injected to ensure that the required folders
are processed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-cdi-hook enable-cuda-compat hook that checks the
container for cuda compat libs and updates /etc/ld.so.conf.d to include their
parent folder if their driver major version is sufficient.
This allows CUDA Forward Compatibility to be used when this is not available
through the libnvidia-container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This allows the NVIDIA Container Toolkit to ignore IMEX channel requests
through the NVIDIA_IMEX_CHANNELS envvar or volume mounts and ensures that
the NVIDIA Container Toolkit cannot be used to provide out-of-band access
to an IMEX channel by simply specifying an environment variable, possibly
bypassing other checks by an orchestration system such as kubernetes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Repeated calls to nvsandboxutils.Init and Shutdown are causing
segmentation violations. Here we disabled nvsandbox utils unless explicitly
specified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This changes ensures that the cdi modifier also removes the NVIDIA
Container Runtime Hook from the incoming spec. This aligns with what is
done for CSV modifications and prevents an error when starting the
container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an EnableCDI method to the container engine config files and
Updates the 'nvidia-ctk runtime configure' command to use this new method.
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This change adds the nvidia-imex and nivdia-imex-ctl binaries to
the list of driver binaries that are searched when using CDI.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an allow-cuda-compat-libs-from-container feature flag
to the NVIDIA Container Toolkit config. This allows a user to opt-in
to the previous default behaviour of overriding certain driver
libraries with CUDA compat libraries from the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change passes the --no-cntlibs argument to the nvidia-container-cli
from the nvidia-container-runtime-hook to disable overwriting host
drivers with the compat libs from a container being started.
Note that this may be a breaking change for some applications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In CSV mode the CSV files at /etc/nvidia-container-runtime/host-files-for-container.d/
should be the source of truth for container modifications. This change skips graphics
modifications to a container. This prevents conflicts when handling files such as
vulkan icd files which are already defined in the CSV file.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change moves the containerized installer from nvidia-toolkit to
cmd/nvidia-ctk-installer to allow for its use in CI.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for containerd configs with version=3.
From the perspective of the runtime configuration the contents of the
config are the same. This means that we just have to load the new
version and ensure that this is propagated to the generated config.
Note that v3 config also requires a switch to the 'io.containerd.cri.v1.runtime'
CRI runtime plugin. See:
https://github.com/containerd/containerd/blob/v2.0.0/docs/PLUGINS.mdhttps://github.com/containerd/containerd/issues/10132
Note that we still use a default config of version=2 since we need to
ensure compatibility with older containerd versions (1.6.x and 1.7.x).
Signed-off-by: Sam Lockart <sam.lockart@zendesk.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This removes the untested watch option from the
nvidia-ctk system create-dev-char-symlinks command.
This also removes the direct dependency on fsnotify.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a imex mode to CDI spec generation. This mode detected
generates CDI specifications for existing IMEX channels. By default these
devices have the fully qualified CDI device names:
nvidia.com/imex-channel=<ID>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change only allows host-relative LDConfig paths.
An allow-ldconfig-from-container feature flag is added to allow for this
the default behaviour to be changed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
For legacy images (images with a CUDA_VERSION set but no CUDA_REQUIRES set), the
default behaviour for device envvars is to treat non-existence as all.
This change ensures that the NVIDIA_IMEX_CHANNELS envvar is not treated in the same
way, instead returning no devices if the envvar is not set.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This fix ensures that the default config file path for the nvidia-ctk runtime configure
command is set consistently.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that we fall back to the previous behaviour of
reading the existing config from the specified config file if extracting
the current config from the command line fails. This fixes use cases where
the containerd / crio executables are not available.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds basic toolkit installation unit tests. This required
that the source for files be specified when installing to allow for
a testdata folder to be used.
This replaces the currently unused shell-based tests in /test/container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the create-symlink hook to be equivalent to
ln -f -s target link
This ensures that links are updated even if they exist in the container
being run.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change fixes a bug when using just-in-time CDI spec generation for the
NVIDIA Container Runtime for specific devices (i.e. not 'all').
Instead of unconditionally using the default nvsandboxutils library -- leading
to errors due to undefined symbols -- we check whether the library can be
properly initialised before continuing.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the create-symlinks hook to always evaluate
link paths in the container's root filesystem. In addition the
executable is updated to return an error if a link could not
be created.
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This chagne ensures that we always treat the link path as a path
relative to the container root. Without this change, relative paths
in link paths would result links being created relative to the
current working directory where the hook is executed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The hostRoot argument is always empty and not applicable to
how links are specified.
Links are specified by the paths in the container filesystem and as such
the only transform required to change the root is a join of the filepath.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since hostRoot is always the empty string and we are changing the root in the
target path to /, the call to changeRoot is redundant.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change udpates the ldcache locator to read the ldcache at construction
and use these contents to perform future lookups against. Each of the cache
entries are resolved and lookups return the resolved target.
Assuming a symlink chain: libcuda.so -> libcuda.so.1 -> libcuda.so.VERSION, this
means that libcuda.so.VERION will be returned for any of the following inputs:
libcuda.so, libcuda.so.1, libcudal.so.*.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a test for locating libcuda as a driver library.
This includes a failing test on a system where libcuda.so.1 is in
the ldcache, but not at one of the predefined library search paths.
A testdata folder with sample root filesystems is included to test
various combinations.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we use a map to keep track of the elements of a symlink chain
the construction of the final list of located elements is not stable.
This change constructs the output as this is being discovered and as
such maintains the original ordering.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes support for specifying csv-filenames when
calling the create-symlinks hook. This is no longer required
as tegra-based systems generate hooks with `--link` arguments.
This also allows the hook to better serve as a reference implementation
for upstream projects wanting to implement a set of standard CDI hooks.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the toolkit works with older
versions of the GPU Operator where runtime-specific envvars are
used to set options such as the config file location.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows IMEX channels to be requested using the
volume mount mechanism.
A mount from /dev/null to /var/run/nvidia-container-devices/imex/{{ .ChannelID }}
is equivalent to including {{ .ChannelID }} in the NVIDIA_IMEX_CHANNELS
envvironment variables.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables opt-in (off-by-default) features to be opted into.
These features can be toggled by name by specifying the (repeated)
--opt-in-features command line argument or as a comma-separated list
in the NVIDIA_CONTAINER_TOOLKIT_OPT_IN_FEATURES environment variable.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change REMOVES the ability to set opt-in features
(e.g. GDS, MOFED, GDRCOPY) in the config file. The existing
per-container envvars are unaffected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
add default runtime binary path to runtimes field of toolkit config toml
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
[no-relnote] Get low-level runtimes consistently
We ensure that we use the same low-level runtimes regardless
of the runtime engine being configured. This ensures consistent
behaviour.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
address review comment
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
This change aligns the creation of symlinks under CDI with
the implementation in libnvidia-container. If the driver libraries
are present, the following symlinks are created:
* {{ .LibRoot }}/libcuda.so -> libcuda.so.1
* {{ .LibRoot }}/libnvidia-opticalflow.so -> libnvidia-opticalflow.so.1
* {{ .LibRoot }}/libGLX_indirect.so.0 -> libGLX_nvidia.so.{{ .Version }}
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change converts the toolkit installation logic to a go package
and invokes this installation over the go API instead of starting
this executable.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change refactors the toml config file handlig for runtimes
such as containerd or crio. A toml.Loader is introduced that
encapsulates loading the required file.
This can be extended to allow other mechanisms for loading
loading the current config.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we expect .so.1 symlinks to be created by ldconfig, we don't
explicitly request these. This change removes the creation of
a libnvidia-allocator.so.1 -> libnvidia-allocator.so.RM_VERSION symlink
through the create-symlinks hook.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds the vdpau subfolder to the paths searched
for driver libraries. This allows the libvdpau_nvidia.so.RM_VERSION
library to also be discovered.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a safe pointer conversion to fix an
incompatible C pointer conversion, which caused build failures on some
architectures.
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
This change adds a new discoverer for Sandboxutils to report the file
system paths and associated symbolic links using GetGpuResource and
GetFileContent APIs. Both GPU and MIG devices are supported. If the
Sandboxutils discoverer fails, the NVML discoverer is used to report
the file system information.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
This change includes the usage of Sandboxutils GetDriverVersion API to
retrieve the CUDA driver version. If the library is not available on the
system or the API call fails for some other reason, it will fallback to
the NVML API to return the driver version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
This change adds manual wrappers around the generated bindings to make
them into more user-friendly APIs for the caller. Some helper functions
are also added.
The APIs that are currently present in the library and implemented here
are:
nvSandboxUtilsInit
nvSandboxUtilsShutdown
nvSandboxUtilsGetDriverVersion
nvSandboxUtilsGetGpuResource
nvSandboxUtilsGetFileContent
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
This change adds the internal bindings for Sandboxutils, some of which
have been automatically generated with the help of c-for-go. The format
followed is similar to what is used in go-nvml. These would need to be
regenerated when the header file is modified and new APIs are added.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
This change adds a script and related files to generate the internal
bindings for Sandboxutils library with the help of c-for-go.
This can be used to update the bindings when the header file is modified
with reference to how they are generated with the Makefile in go-nvml.
Run: ./update-bindings.sh
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Sananya Majumder <sananyam@nvidia.com>
With the move to dependabot to udpate the libnvidia-container
submodule it is no longer required to have checks in place that
ensure that this is up to date when building.
In addition when re-building old commits in CI, for example we explicitly
do NOT want to have this check.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables opt-in (off-by-default) features to be opted into.
These features can be toggled by name by specifying the (repeated)
--opt-in-feature command line argument or as a comma-separated list
in the NVIDIA_CONTAINER_TOOLKIT_OPT_IN_FEATURES environment variable.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an include-persistenced-socket flag to the
nvidia-ctk cdi generate command that ensures that a generated
specification includes the nvidia-persistenced socket if present on
the host.
Note that for mangement mode, these sockets are always included
if detected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the internal CDI representation includes
the persistenced socket if the include-persistenced-socket feature
flag is enabled.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This changes skips the injection of the nvidia-persistenced socket by
default.
An include-persistenced-socket feature flag is added to allow the
injection of this socket to be explicitly requested.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the create-symlinks hook to check whether
link paths resolve in the container's filesystem. In addition
the executable is updated to return an error if a link could
not be created.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that we unnecessarily print warnings for
runtimes where these configs are not applicable.
This removes the following warnings:
WARN[0000] Ignoring runtime-config-override flag for docker
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds logic to extract the golang version from a Dockefile.
This allows us to trigger golang updates using dependabot.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the logic to populate the options for the
nvidia runtime configs added to containerd or crio from a default runtime
if this is specified and a runc entry is not found.
This allows the default runtime values for settings such as SystemdCgroup
to be applied correctly.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change moves most of the logging for the NVIDIA Container runtime
from Info to Debug or Trace -- especially in the case where no
modifications are requried.
This removes execessive logging.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change makes the following changes:
* Allows the toolkit.pid path to be specified
* Creates the toolkit.pid file at /run/nvidia/toolkit/toolkit.pid by default
* Handles failures to remove the /run/nvidia/toolkit folder
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the created /etc/ld.so.conf.d file
has a higher priority to ensure that the injected libraries
take precendence over non-compat libraries.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Tools such as oc mirror do not support the provenence metadata
added to the image manifests with newer docker buildx versions.
This change disables the addition of provenance information.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Although the nvidia-ctk cdi generate command generates
specs with 644 permissions, the nvidia-ctk cdi transform
commands do not. This change sets the default permissions
to 600 instead of 644.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the nvidia-ctk system create-device-nodes
flag driver-root to root. This makes it clearer that this is
used to load the kernel modules and is not specific to the
user-mode driver installation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This changes adds an option to the toolkit container to allow
the dev root to be specified. This adds support for driver installations
where the driver files are at one root and the dev nodes are created
elsewhere -- most typically at /. This is the case, for example, for
GKE driver installations.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that nvnllib and devicelib are constructed
before these are used to construct infolib.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change creates an nvidia-cdi-hook binary for implementing
CDI hooks. This allows for these hooks to be separated from the
nvidia-ctk command which may, for example, require libnvidia-ml
to support other functionality.
The nvidia-ctk hook subcommand is maintained as an alias for the
time being to allow for existing CDI specifications referring to
this path to work as expected.
Signed-off-by: Avi Deitcher <avi@deitcher.net>
This allows settings such as:
nvidia-ctk config --set nvidia-container-runtime.runtimes=crun:runc
to be applied correctly.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change moves from using strings to useing device.Identifiers
as input for requesting CDI specifications for specific
devices.
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
When running nvidia-ctk on a system that uses a custom XDG_DATA_DIRS
environment variable value, the configuration files for `glvnd`,
`vulkan`, and `egl` fail to get passed through from the host to the
container. Reading from XDG_DATA_DIRS instead of hardcoding the default
value allows for finding said files so they can be mounted in the
container.
Signed-off-by: Jared Baur <jaredbaur@fastmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a features config that allows
individual features to be toggled at a global level. Each feature can (by default)
be controlled by an environment variable.
The GDS, MOFED, NVSWITCH, and GDRCOPY features are examples of such features.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change includes an arm64 arch check when installing golang.
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures taht NVIDIA_VISIBLE_DEVICES=void is included in
generated CDI specs. This prevents the NVIDIA Container Runtime Hook
from injecting devices if NVIDIA_VISIBLE_DEVICES=all is set.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes the additional libnvidia-container0=0.10.0+jetpack dependency
that was introduced for Tegra-based systems. These have since been migrated to
CDI-based direct injection using the NVIDIA Container Runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --create-device-nodes option to the toolkit config CLI.
Most noteably, this allows the creation of control devices to be skipped
when CDI spec generation is enabled.
Currently values of "", "node", and "control" are supported and can be set
via the command line flag or the CREATE_DEVICE_NODES environment variable.
The default value of CREATE_DEVICE_NODES=control will trigger the creation
of control device nodes. Setting this envvar to include the (comma-separated)
strings of "" or "none" will disable device node creation regardless of
whether other supported strings are included.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that CLI tools that require the path to the
driver root accept both the NVIDIA_DRIVER_ROOT and DRIVER_ROOT
environment variables in addition to the --driver-root command
line argument.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes the centos8-x86_64 and centos8-aarch64 pipeline jobs.
These packages are no longer used since centos7 packages are used instead.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Instead of relying only on Experimental mode, the docker daemon
config requires that CDI is an opt-in feature.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Allow for customizing the path to ldconfig and call ldconfig with explicit cache/config paths
See merge request nvidia/container-toolkit/container-toolkit!525
Since the `update-ldcache` hook uses the host's `ldconfig`, the default
cache and config files configured on the host will be used. If those
defaults differ from what nvidia-ctk expects it to be (/etc/ld.so.cache
and /etc/ld.so.conf, respectively), then the hook will fail. This change
makes the call to ldconfig explicit in which cache and config files are
being used.
Signed-off-by: Jared Baur <jaredbaur@fastmail.com>
Since the `createContainer` `runc` hook runs with the environment that
the container's config.json specifies, the path to `ldconfig` may not be
easily resolvable if the host environment differs enough from the
container (e.g. on a NixOS host where all binaries are under hashed
paths in /nix/store with an Ubuntu container whose PATH contains
FHS-style paths such as /bin and /usr/bin). This change allows for
specifying exactly where ldconfig comes from.
Signed-off-by: Jared Baur <jaredbaur@fastmail.com>
This change adds crun as a configured low-level runtime.
Note that runc still preferred and will be used if present on the
system.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
For users running the nvidia-container-runtime it would be useful
to determine the runtime mode used from the logs directly instead
of relying on other log messages as signals. This change ensures
that an explicitly selected mode is also logged instead of only
when mode=auto.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Using `WithDevRoot` on Tegra platforms was incorrectly setting
`driverRoot`, fix it so that it correctly sets `devRoot`.
Signed-off-by: Jared Baur <jaredbaur@fastmail.com>
Refactor the engine.Interface such that the Set() API does not return an extraneous error
See merge request nvidia/container-toolkit/container-toolkit!515
This change adds a --relative-to option to the nvidia-ctk transform root
command. This defaults to "host" maintaining the existing behaviour.
If --relative-to=container is specified, the root transform is applied to
container paths in the CDI specification instead of host paths.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the root transformer to indicate that it
operates on host paths and adds a container root transformer for
explicitly transforming container roots.
The transform.NewRootTransformer constructor still exists, but has
been marked as deprecated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to using the reflect package to determine
the type of config options instead of inferring the type from the
Toml data structure.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for an NVIDIA_NVSWITCH environment variable.
When set to `enabled` this striggers the injection of all available
/dev/nvidia-nvswitch* device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a driver root abstraction that defines how
libraries are located relative to the root. This allows for
this driver root to be constructed once and passed to discovery
code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Instead of relying solely on a static config, we resolve the path
to ldconfig. The path is checked for existence and a .real suffix is preferred.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames NewGraphicsDiscoverer to NewDRMNodesDiscoverer and
instead calls NewGraphicsMountsDiscoverer explicitly when constructing
a graphics modifier.
This avoids the import of config.Config into the discover package
which leads to a transitive dependency on toml-specifics and
requires that the vendor/github.com/pelletier/ package
be vendored in to consumers.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
A driverRoot defines both the driver library root and the
root for device nodes. In the case of preinstalled drivers or
the driver container, these are equal, but in cases such as GKE
they do not match. In this case, drivers are extracted to a folder
and devices exist at the root /.
The changes here add a devRoot option to the nvcdi API that allows the
parent of /dev to be specified explicitly.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In some cases we might get a permission error trying to chmod -
most likely this is due to something beyond our control
like whole `/dev` being mounted.
Do not fail container creation in this case.
Due to loosing control of the program after `exec()`-ing `chmod(1)` program
and therefore not being able to handle errors -
refactor to use `chmod(2)` syscall instead of `exec()` `chmod(1)` program.
Fixes: #143
Signed-off-by: Ievgen Popovych <jmennius@gmail.com>
This change skips the update of ld.cache in the container if it
doesn't exist. Instead, the -N flag is used to only create the
relevant symlinks.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we allow pattern inputs for locating symlinks we could have
multiples. The error being checked is resolved by the deduplication.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes make library lookups more robust. The core change is that
library lookups now first look a set of predefined locations before checking
the ldcache. This also handles cases where an ldcache is not available more
gracefully.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows CDI devices to be requested as mounts in the
container. This enables their use in environments such as kind
where environment variables or annotations cannot be used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change refactors the use of the symlink filter to make it extendible.
A blocked filter can be set on the Tegra CSV discoverer to ensure that the correct
symlink libraries are filtered out. Here, globs can be used to select mulitple libraries,
and a **/ prefix on the globs indicates that the pattern that follows is only applied to
the filename of the symlink entry in the CSV file.
A --csv.ignore-pattern command line argument is added to the nvidia-ctk cdi generate
command that allows this to be set.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change improves the testibility of the CSV discoverer.
This is done by adding injection points for mocks for library discovery and
symlink resolution.
Note that this highlights a bug in the current implementation where the
library filter causes valid symlinks to be skipped.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a "required" option to the new toml config
that controls whether a default config is returned or not.
This is useful from the NVIDIA Container Runtime Hook, where
/run/driver/nvidia/etc/nvidia-container-runtime/config.toml
is checked before the standard path.
This fixes a bug where the default config was always applied
when this config was not used.
See https://github.com/NVIDIA/nvidia-container-toolkit/issues/106
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a UsesNVGPUModule function that checks whether the nvgpu
kernel module is used by NVML. This allows for more robust detection of
Tegra-based platforms where libnvidia-ml.so is supported to enumerate the
iGPU.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates go-nvlib to include logic to skip NVIDIA PCI-E
devices where the name or class id is not known.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the csv.library-search-path option to
library-search-path so as to be more generally applicable in
future. Note that the option is still only applied in csv mode.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne simplifies the nvidia-ctk config default command.
By default it now outputs the default config to STDOUT, and can
optionally output this to file.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change introduced a config.Toml type that is used as the base for
config file processing and manipulation. This ensures that configs --
including commented values -- can be handled consistently.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Migrate to internal/config.Config structs for the NVIDIA Container Runtime Hook config.
See merge request nvidia/container-toolkit/container-toolkit!463
This change ensures that the Config structs from internal.Config
are used for the NVIDIA Container Runtime Hook config too.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change removes installation of the oci-nvidia-hook files.
These files conflict with CDI use in runtimes that support it.
The use of the hook should be considered deprecated on these platforms.
If a hook is required, the
nvidia-ctk runtime configure --config-mode=oci-hook
command should be used to create the hook file(s).
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change extends the nvidia-ctk runtime configure command
with a --config-mode=oci-hook that creates an OCI hook json file.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the centos7 and ubuntu18.04 packages are
published to the generic rpm and deb repos, respectively.
All other packages except the centos8-ppc64le packages are skipped
as these use cases are covered by the generic packages.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In order to properly handle systems with both iGPU and dGPU
drivers included, we skip "sym" mount specifications which
refer to .so or .so.[1-9] files.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change explicitly generates a CDI specification from
the supplied CSV files when cdi mode is detected. This
ensures consistency between the behaviour on Tegra-based
systems.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
If the config.toml has an empty root specified, this could be
passed to the NVIDIA Container CLI through the --root flag
which causes argument parsing to fail. This change only
adds the --root flag if the config option is specified
and is non-empty.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change generates CDI specifications for Tegra devices
with the nvidia.com/gpu=0 name by default. The type-index
nameing strategy is also supported and will generate a device
with the name nvidia.com/gpu=gpu0.
The uuid naming strategy will raise an error if selected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the incoming OCI spec has already been parsed and used to
construct a CUDA image representation, pass this to the CSV
modifier constructor instead of re-creating an image representation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change sets the default CDI spec dirs at a config level instead
of when a CDI runtime modifier is constructed. This makes this setting
consistent with other options such as the nvidia-ctk path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Include Shared Compiler Library (libnvidia-gpucomp.so) in the list of compute libaries.
See merge request nvidia/container-toolkit/container-toolkit!442
Path to locate the GSP firmware is explicitly set to /lib/firmware/nvidia.
Users may chose to install the GSP firmware in alternate locations where
the kernel would look for firmware on the root filesystem.
Add locate functionality which looks for the GSP firmware files in the
same location as the kernel would
(https://docs.kernel.org/driver-api/firmware/fw_search_path.html).
The paths searched in order are:
- path described in /sys/module/firmware_class/parameters/path
- /lib/firmware/updates/UTS_RELEASE/
- /lib/firmware/updates/
- /lib/firmware/UTS_RELEASE/
- /lib/firmware/
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The debian and rpm packages are updated to trigger the generation of
of a default config if no config exists at the expected location.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the nvidia-ctk config default command
generates a config file that is compatible with the official documentation
to, for example, disable cgroups in the NVIDIA Container CLI.
This requires that whitespace around comments is stripped before outputing the
contets.
This also adds an option to load a config and modify it in-place instead. This can
be triggered as a post-install step, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The stable-dind image is out of date and has not been updated for 3 years.
This change updates to the latest dind image.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the actual discovered path of nvidia-smi when
creating a symlink to the binary on WSL2 platforms.
This ensures that cases where multiple driver store paths are
detected are supported.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This changes splits the functionality in the internal system package
into two packages: one for dealing with devices and one for dealing
with kernel modules. This removes ambiguity around the meaning of
driver / device roots in each case.
In each case, a root can be specified where device nodes are created
or kernel modules loaded.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Thsi change adds the --nvidia-runtime-dir as a command line
argument when configuring container runtimes in the toolkit container.
This removes the need to set it via the command line.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --create-device-nodes option to the
nvidia-ctk system create-dev-char-symlinks command to create
device nodes. The currently only creates control device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add a --load-kernel-modules option to the
nvidia-ctk system commands. If specified the NVIDIA kernel modules
(nvidia, nvidia-uvm, and nvidia-modeset) are loaded before any
operations on device nodes are performed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the installed NVIDIA Container Runtime Hook wrapper
as the path in the applied config. This prevents conflicts with other
installations of the NVIDIA Container Toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime-hook.path config option
to allow the path used for the prestart hook to be overridden. This
is useful in cases where multiple NVIDIA Container Toolkit installations
are present.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to generating a OCI runtime hook to create
individual symlinks instead of processing a CSV file in the hook.
This allows for better reuse of the logic generating CDI
specifications, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a symlinks.Resolve function for resolving symlinks and
updates usages across the code to make use of it.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates go-nvlib to ensure that non-migcapable GPUs
are skipped when generating CDI specifications for MIG devices.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne allows the csv mode option to specified in the
nvidia-ctk cdi generate command and adds a --csv.file option
that can be repeated to specify the CSV files to be processed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that libcuda.so can be located on systems
where no patch version is specified in the driver version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The nvcid api is extended to allow for merged device options to
be specified. If any options are specified, then a merged device
is generated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes remove the use of discover.Config which was used
to pass the driver root and the nvidiaCTK path in some cases.
Instead, the nvidiaCTKPath is resolved at the begining of runtime
invocation to ensure that this is valid at all points where it is
used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the default configuration is now platform specific,
there is no need to install specific versions as part of
the package installation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Also include manifest.txt with, for example, version
info when extracting packages from the packagin image.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a GetDefaultConfigToml function to the config package.
This function returns the default config in the form of raw TOML
including comments. This is useful for generating a default config at
installation time, with platform-specific differences codified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This ensures that the artifacts associated with a particular
release version are preserved along with the container
images that are used as operands for this version.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a CLI command to generate a default config.
This config checks the host operating system to apply specific
modifications that were previously captured in static config
files.
These include:
* select /sbin/ldconfig or /sbin/ldconfig.real depending on which exists on the host
* set the user to allow device access on SUSE-based systems
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the struct for storing CLI flag values options over
config to avoid a conflict with the config package.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Generate CDI specifications with 644 permissions to allow non-root clients to consume them
See merge request nvidia/container-toolkit/container-toolkit!381
By default, temporary files are created with permissions 600 and
this means that the files created when updating the ldcache are
not readable in non-root containers.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Support building nvidia-docker and nvidia-container-runtime as dist-independent packages
See merge request nvidia/container-toolkit/container-toolkit!379
In order to add the libnvidia-container0 packages to our ubuntu18.04
kitmaker archive, a workaround was added that downloaded these packages
before constructing the archive. Since the packages have now been
published -- and will not change -- this workaround is not longer needed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the nvidia-docker2 and nvidia-container-runtime
components are not build and distributed for patch releases.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.runtimes config option.
If this is unset no changes are made to the config and the default values are used. This
allows this setting to be overridden in cases where this is required. One such example is
crio where crun is set as the default runtime.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk system create-device-nodes command for
creating NVIDIA device nodes. Currently this is limited to control devices
(nvidia-uvm, nvidia-uvm-tools, nvidia-modeset, nvidiactl).
A --dry-run mode is included for outputing commands that would be executed and
the driver root can be specified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.modes.cdi.annotation-prefixes config
option that defaults to cdi.k8s.io/. This allows the annotation prefixes parsed
for CDI devices to be overridden in cases where CDI support in container engines such
as containerd or crio need to be overridden.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows nvcdi.New to return an error in addition to the
constructed library instead of panicing.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
As simplified CDI spec has no duplicate entities in any single set of container edits.
Furthermore, contianer edits defined at a spec-level are not included in the container
edits for a device.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since we relied on finding libcuda.so in the LDCache to determine both the CUDA
version and the expected directory for the driver libraries, the generation of the
management CDI specifications fails in containers where the LDCache has not been updated.
This change falls back to searching a set of predefined paths instead when the lookup of
libcuda.so in the cache fails.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
CDI generation modes such as management and wsl don't require
NVML. This change removes the top-level instantiation of nvmllib
and replaces it with an instanitation in the nvml CDI spec generation
code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This chagne prefers (non-symlink) sockets at /run over /var/run for
nvidia-persistenced and nvidia-fabricmanager sockets.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds .cdi and .legacy mode-specific runtimes the list of
runtimes supported by the operator. These are also installed as
part of the NVIDIA Container Toolkit.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change switches to using nvidia-container-runtime.experimental as the
wrapper name over nvidia-container-runtime-experimental. This is consistent
with upcoming mode-specific binaries.
The wrapper is created at nvidia-container-runtime.experimental.real.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the nvidia-container-runtime.modes.cdi.default-kind
to be set in the toolkit-container.
The NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND envvar is used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The following changes are made:
* The default-cdi-kind config option is used to convert an envvar entry to a fully-qualified device name
* If annotation devices exist, these are used instead of the envvar devices.
* The `all` device is no longer treated as a special case and MUST exist in the CDI spec.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add support for generating a management spec to the nvcdi API.
A management spec consists of a single CDI device (`all`) which includes all expected
NVIDIA device nodes, driver libraries, binaries, and IPC sockets.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Fix handling of envvars in toolkit container which modify the NVIDIA Container Runtime config
See merge request nvidia/container-toolkit/container-toolkit!317
This change generates device folder permission hooks per device instead of
at a spec level. This ensures that the hook is not injected for a device that
does not have any nested device nodes.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes add a wsl discovery mode to the nvidia-ctk cdi generate command.
If wsl mode is enabled, the driver store for the available devices is used as
the source for discovered entities.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds --discovery-mode flag to the nvidia-ctk cdi generate
command and plumbs this through to the CDI API.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change copies dxcore.h and dxcore.c from libnvidia-container to
allow for the driver store path to be queried. Modifications are made
to dxcore to remove the code associated with checking the components
in the driver store path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the nvidia-container-runtime.mode option to be set
by the toolkit container.
This is controlled by the --nvidia-container-runtime-mode command line
argument and the NVIDIA_CONTAINER_RUNTIME_MODE envvar.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Newer drivers have split the GSP firmware into multiple files so a simple match
against gsp.bin in the firmware directory is no longer possible. This patch
adds globbing capabilitis to match any GSP firmware files of the form gsp*.bin
and mount them all into the container.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This change adds an nvcdi package that exposes a basic API for
CDI spec generation. This is used from the nvidia-ctk cdi generate
command and can be consumed by DRA implementations and the device plugin.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.cdi executable that
overrides the runtime mode from the config to "cdi".
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The version for RPM release candidates has the form `1.13.0-0.1.rc.1-1` whereas debian packages have the form `1.13.0~rc.1-1`.
Note that since the `~` is handled in [the same way](https://docs.fedoraproject.org/en-US/packaging-guidelines/Versioning/#_handling_non_sorting_versions_with_tilde_dot_and_caret) as for Debian packages, there does not seem to be a specific reason for this and dealing with multiple version strings in our entire pipeline adds complexity.
This change aligns the package versioning for rpm packages with Debian packages.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This makes the intent of the command line argument clearer since this
relates specifically to the root where the NVIDIA driver is installed.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the update-ldcache hook is created in a manner
consistent with other nvidia-ctk hooks ensuring that a full path is
used.
Without this change the update-ldcache hook on Tegra-based sytems had an
invalid path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
If this is not done, the default config which sets the nvidia-ctk.path
option as "nvidia-ctk" will result in an invalid OCI spec if a hook is
injected. This change ensures that the path used is always an absolute
path as required by the hook spec.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change simplifies how Kitmaker archives are constructed.
Currently only centos8 and ubuntu18.04 packages are included.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses the `index` mode for the --device-name-strategy when
generating CDI specifications by default. This generates device names
such as nvidia.com/gpu=0 or nvidia.com/gpu=1:0 by default.
Note that this requires a CDI spec version of 0.5.0 and for consumers
(e.g. podman) that are only compatible with older versions one of the
other stragegies (`type-index` or `uuid`) should be used instead to
generate a v0.3.0 or v0.4.0 specification.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --create-all mode to the create-dev-char-symlinks hook.
This mode creates all POSSIBLE symlinks to device nodes for regular and cap
devices. With the number of GPUs inferred from the PCI device information.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --watch option to the create-dev-char-symlinks hook. This
installs an fsnotify watcher that creates symlinks for ADDED device nodes under
/dev/char.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk hook create-dev-char-symlinks
subcommand that creates symlinks to device nodes (as required by
systemd) under /dev/char.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --device-name-strategy flag for generating a CDI
specificaion. This allows a CDI spec to be generated with the following
names used for device:
* type-index: gpu0 and mig0:1
* index: 0 and 0:1
* uuid: GPU and MIG UUIDs
Note that the use of 'index' generates a v0.5.0 CDI specification since
this relaxes the restriction on the device names.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the first match of an executable in the path
is retured instead of a list of candidates. This prevents a CDI spec,
for example, from containing multiple entries for a single executable
(e.g. nvidia-smi).
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses functionality from the CDI package to determine
the minimum required CDI spec version. This allows for a spec with
the widest compatibility to be specified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change implements the discovery of versioned driver libaries
by reusing the mounts and update ldcache discoverers use for, for example,
CVS file discovery. This allows the container paths to be correctly generated
without requiring specific manipulation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an --nvidia-ctk-path to the nvidia-ctk cdi generate
command. This ensures that the executable path for the generated
hooks can be specified consistently.
Since the NVIDIA Container Runtime already allows for the executable
path to be specified in the config the utility code to update the
LDCache and create other nvidia-ctk hooks are also updated.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change refactors the generation of CDI specifications
to use discoverers and generate the CDI specifications from these
discoverers. This allows for better reuse.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The HostPath field was added in the v0.5.0 CDI specification.
The cdi package uses strict unmarshalling when loading specs
from file causing failures for unexpected fields.
Since the behaviour for HostPath == "" and HostPath == Path are
equivalent, we clear HostPath if it is equal to Path to ensure
compatibility with the widest range of specs.
This allows, for example, a v0.4.0 spec to be generated as required.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change replaces the `--json` flag of the nvidia-ctk cdi generate
command with a --format flag that accepts a string format of either
json or yaml.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change extends the support for multiple envvars when
specifying swarm resources to consider ALL of the specified
environment variables instead of the first match.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Inject DRM device nodes into containers when Graphics or Display capabilities are requested
See merge request nvidia/container-toolkit/container-toolkit!235
This change adds the discovery of DRM devices associated with requested
devices. This means that the /dev/dri/card* and /dev/dri/renderD*
devices associated with each requested NVIDIA GPU are injected into
the container and that the /dev/dri/by-path symlinks associated with
these devices are created in the container.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds support for filtering entities by specifying a filter.
This can be used, for example, to check whether a mount or device
has a particular property and removing it from the set of discovered
entities if it does not.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Devices abstraction to the CUDA image utilities. This
allows for checking whether a devices is selected, for example.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Add nvidia-ctk hook chmod command to set permissions and ensure permissions of `/dev/nvidia-caps` is set
See merge request nvidia/container-toolkit/container-toolkit!232
This change generates one or more createContainer hooks for ensuring
that subfolders in /dev have the required permissions in the container.
As an example, a user requires read permissions to the /dev/nvidia-caps
in addition to including the specific caps devices under this folder.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-ctk hook chmod command that can be used
to update the permissions for paths in the container.
This prepends the container root to the paths to allow these to be
updated by runtime executables.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the CDI spec mounts the ipc sockets with the
noexec flag to allow these to function in rootless mode with podman.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the docker config update for simplicitly.
This also allows for the API to match the crio update code.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This adds support for updating crio configs (instead of installing hooks)
and adds crio support to the nvidia-ctk runtime configure command.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the ordering of internal pipeline dependencies to
ensure that the correct rules are applied.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change includes meta devices (e.g. /dev/nvidiactl) in the
generated CDI spec. Missing device nodes are ignored.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change generates a v0.4.0 CDI spec instead of a v0.5.0 spec.
This allows older versions of podman, for example, to be used.
This requires that the device names do not start on a numeric character
and that the HostPath for a device is unspecified.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the swarm-resource config option to specify a
comma-separated list of environment variables instead of a single
environment variable.
The first environment variable matched is considered and other
environment variables are ignored.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds functionality to generate CDI specifications
for all devices detected on the system. A specification containing
all GPUs and MIG devices is generated. All libraries on the host
ldcache that have an NVIDIA Driver Version suffix are included as
are the required binaries and IPC sockets.
A hook (based on the nvidia-ctk hook subcommand) to update the ldcache
in the container for the libraries being injected is also added to the
CDI specificiation.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the NVIDIA Container Runtime to inject vulkan
loaders and libraries by modifying the OCI runtime specification.
This allows vulkan applications to run in containers without
additional modifications.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Locator that can be used to locate libraries.
If library names are specified, the ldcache is searched otherwise
symlinks are resolved.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that a more concrete error is provided by the NVIDIA
Container Runtime if the NVIDIA Container Runtime hook cannot be
located.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the destination / root to be set as the
first positional argument OR as a command line flag. This
allows for the GPU Operator to transition to a case where
on the flag / envvar is used.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This reuses the docker file for yum-based rpm distros (centos, amazonlinux)
instead of maintaining two files with the same contents.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change splits the nvidia-container-toolkit package into the top-level package and
an nvidia-container-toolkit-base package.
The nvidia-container-toolkit-base package allows the NVIDIA Container Runtime and
NVIDIA Container Toolkit CLI to be installed on systems without requiring that the
NVIDIA Container Runtine Hook and the transitive dependencies included in the NVIDIA
Container Library and NVIDIA Container CLI also be installed.
This allows the runtime to be used on systems where the CSV or CDI mode of the runtime
is used exclusively.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a modifier to that injects the tegra platform files
* /etc/nv_tegra_release
* /sys/devices/soc0/family
allowing these files to be used for platform detection in a containerized
context such as the GPU device plugin.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the
* accept-nvidia-visible-devices-envvar-when-unprivileged
* accept-nvidia-visible-devices-as-volume-mounts
options to be set in the toolkit-container. These are controlled
by command line flags or the following environment variables:
* ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
* ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a root member to the mounts type that is used to
perform most of the lookups for files and devices. This allows
for consistent handling of relative paths.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change improves the error message when invoking the NVIDIA
Runtime Hook in non-legacy mode. This should guide users to specifying
the --runtime=nvidia flag when using docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
The Relative method added to the Locator interface was
not correctly implemented in the file type. The root was
never set when instantiating the object.
This change removes this method from the interface and the file
type, switching to a local implementation in the mounts type
instead.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Since the creation of symlinks may include other libraries / folders
the ldcache should be updated AFTER the symlinks are created.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a `runtime configure` command to the nvidia-ctk CLI. This
command is currently limited to configuring the docker config on the
system by modifying the daemon.json config file associated with docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
In preparation for adding a command to the nvidia-ctk CLI to modify
the docker config, this change refactors load, update, and flush logic
from the toolkit container docker CLI to an internal package.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an nvidia-container-runtime.modes.cdi.spec-dirs
config option that allows the default spec dirs to be overridden.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change reuse the code that checks for the existing NVIDIA
Container Runtime hook to ensure that both nvidia-container-toolkit
and nvidia-container-runtime-hook are detected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change renames the nvidia-container-toolkit executable
to nvidia-container-runtime-hook. Here nvidia-container-toolkit
is created as a symlink to nvidia-container-runtime-hook.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
For rc releases we allow nvidia-container-toolkit versions
to not match libnvidia-container versions. This change ensures
that only changed packages are released.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This mirrors what is done in cri-o and allows for devices nodes
from, for example, the driver container to be injected into a
container at /dev instead of <ROOT>/dev
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a charDevices discoverer and using this
for CSV, GDS, and MOFED discovery. Internally the discoverer
is a "mounts" discoverer with a charDevice locator.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Instead of creating a set of discoverers per file, this change creates
a discoverer per type by first concatenating the mount specifications
from all files. This will allow all device nodes, for example, to
be treated as a single device.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This adds a Relative function to the Locator interface and uses
this to determine the host and container paths for located files
(and devices). This ensures that the root (e.g. the nvidia driver
root) is stripped from the container path.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change creates GDS and MOFED modifiers and adds them to the
modifer created for the selected runtime mode if the NVIDIA_GDS
and NVIDIA_MOFED envvars are set to "enabled", respectively.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change uses modifier compositioning and the discoverModifier to
refactor the existing CSV modifier.
This change adds a discoverModifier to the internal/modifier package and
refactors the CSV modifier to use this abstraction.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change prevents errors when downloading ubuntu repos on
amd64 architectures. The `stable` images were last pushed
2 years ago.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This ensures that older versions of containerd that may be expecting
this over options.BinaryName should continue to work.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This fixes a bug where the runtime path for v1 containerd configs
was specified in the options.Runtime setting (which is used
for the default runtime) instead of options.BinaryName.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the variables used to construct the
version strings for CMDs are set in the makefile.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds version output to the nvidia-continer-runtime,
nvidia-container-toolkit, and nvidia-ctk CLIs. The same version
is used in all cases and includes a version string and a git
revision if set.
The construction of the version string mirrors what is done in runc.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
These changes replace the nvidia-container-runtime config options
experimental and discover-mode with a single mode config option.
Note that mode is now a string with a default value of "auto"
and a mode value of "legacy" is equivalent to experimental == false.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change updates the create-symlinks hook to also create symlinks for
libcuda.so, libGLX_indirect.so.0, and libnvidia-opticalflow.so
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a GetContainerRoot to the oci.State type to
encapsulate the logic around determining the container root.
This Fixes a bug where relative roots (e.g. as generated by contianerd)
are not supported.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change processes and supports runc logging command line arguments.
This allows for better integration into container engines such as
docker.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This also removes a test that invokes nvidia-container-runtime run --bundle
expecting an error. This test is no longer valid since this command line
is forwared to runc where the error should be detected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a Requirements abstraction that can be used to check
an images' NVIDIA_REQUIRE_* envvars against the host properties such
as CUDA version or architecture.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a CUDA image abstraction that encapsulates
the queries performed on a container image (e.g. envvars) to
check certain CUDA properties.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
- Make CDI device requests consistent with other methods
- Construct container info once
- Add logic to extract annotation device requests to image type
- Add IsPrivileged function to CUDA container type
- Add device IDs to nvcdi.GetSpec API
- Refactor extracting requested devices from the container image
- Add EnvVars option for all nvidia-ctk cdi commands
- Add nvidia-cdi-refresh service
- Add discovery of arch-specific vulkan ICD
- Add disabled-device-node-modification hook to CDI spec
- Add a hook to disable device node creation in a container
- Remove redundant deduplication of search paths for WSL
- Added ability to disable specific (or all) CDI hooks
- Consolidate HookName functionality on internal/discover pkg
- Add envvar to control debug logging in CDI hooks
- Add FeatureFlags to the nvcdi API
- Reenable nvsandboxutils for driver discovery
- Edit discover.mounts to have a deterministic output
- Refactor the way we create CDI Hooks
- Issue warning on unsupported CDI hook
- Run update-ldcache in isolated namespaces
- Add cuda-compat-mode config option
- Fix mode detection on Thor-based systems
- Add rprivate to CDI mount options
- Skip nil discoverers in merge
- bump runc go dep to v1.3.0
- Fix resolution of libs in LDCache on ARM
- Updated .release:staging to stage images in nvstaging
- Refactor toolkit installer
- Allow container runtime executable path to be specified
- Add support for building ubuntu22.04 on arm64
- Fix race condition in mounts cache
- Add support for building ubuntu22.04 on amd64
- Fix update-ldcache arguments
- Remove positional arguments from nvidia-ctk-installer
- Remove deprecated --runtime-args from nvidia-ctk-installer
- Add version info to nvidia-ctk-installer
- Update nvidia-ctk-installer app name to match binary name
- Allow nvidia-ctk config --set to accept comma-separated lists
- Disable enable-cuda-compat hook for management containers
- Allow enable-cuda-compat hook to be disabled in CDI spec generation
- Add disable-cuda-compat-lib-hook feature flag
- Add basic integration tests for forward compat
- Ensure that mode hook is executed last
- Add enable-cuda-compat hook to CDI spec generation
- Add ldconfig hook in legacy mode
- Add enable-cuda-compat hook if required
- Add enable-cuda-compat hook to allow compat libs to be discovered
- Use libcontainer execseal to run ldconfig
- Add ignore-imex-channel-requests feature flag
- Disable nvsandboxutils in nvcdi API
- Allow cdi mode to work with --gpus flag
- Add E2E GitHub Action for Container Toolkit
- Add remote-test option for E2E
- Enable CDI in runtime if CDI_ENABLED is set
- Fix overwriting docker feature flags
- Add option in toolkit container to enable CDI in runtime
- Remove Set from engine config API
- Add EnableCDI() method to engine.Interface
- Add IMEX binaries to CDI discovery
- Rename test folder to tests
- Add allow-cuda-compat-libs-from-container feature flag
- Disable mounting of compat libs from container
- Skip graphics modifier in CSV mode
- Move nvidia-toolkit to nvidia-ctk-installer
- Automated regression testing for the NVIDIA Container Toolkit
- Add support for containerd version 3 config
- Remove watch option from create-dev-char-symlinks
- Add string TOML source
- Improve the implementation for UseLegacyConfig
- Properly pass configSearchPaths to a Driver constructor
- Fix create-device-node test when devices exist
- Add imex mode to CDI spec generation
- Only allow host-relative LDConfig paths
- Fix NVIDIA_IMEX_CHANNELS handling on legacy images
- Fix bug in default config file path
- Fix fsnotify.Remove logic function.
- Force symlink creation in create-symlink hook
### Changes in the Toolkit Container
- Create /work/nvidia-toolkit symlink
- Use Apache license for images
- Switch to golang distroless image
- Switch to cuda ubi9 base image
- Use single version tag for image
- Extract deb and rpm packages to single image
- Bump nvidia/cuda in /deployments/container
- Bump nvidia/cuda in /deployments/container
- Add E2E GitHub Action for Container Toolkit
- Bump nvidia/cuda in /deployments/container
- Move nvidia-toolkit to nvidia-ctk-installer
- Add support for containerd version 3 config
- Improve the implementation for UseLegacyConfig
- Bump nvidia/cuda in /deployments/container
- Add imex mode to CDI spec generation
- Only allow host-relative LDConfig paths
- Fallback to file for runtime config
### Changes in libnvidia-container
- Fix pointer accessing local variable out of scope
- Require version match between libnvidia-container-tools and libnvidia-container1
- Add libnvidia-gpucomp.so to the list of compute libs
- Use VERSION_ prefix for version parts in makefiles
- Add additional logging
- Do not discard container flags when --cuda-compat-mode is not specified
- Remove unneeded --no-cntlibs argument from list command
- Add cuda-compat-mode flag to configure command
- Skip files when user has insufficient permissions
- Fix building with Go 1.24
- Add no-cntlibs CLI option to nvidia-container-cli
- Fix always using fallback
- Add fallback for systems without memfd_create()
- Create virtual copy of host ldconfig binary before calling fexecve()
- Fix some typos in text.
## v1.17.0
- Promote v1.17.0-rc.2 to v1.17.0
- Fix bug when using just-in-time CDI spec generation
- Check for valid paths in create-symlinks hook
## v1.17.0-rc.2
- Fix bug in locating libcuda.so from ldcache
- Fix bug in sorting of symlink chain
- Remove unsupported print-ldcache command
- Remove csv-filename support from create-symlinks
### Changes in the Toolkit Container
- Fallback to `crio-status` if `crio status` does not work when configuring the crio runtime
## v1.17.0-rc.1
- Allow IMEX channels to be requested as volume mounts
- Fix typo in error message
- Add disable-imex-channel-creation feature flag
- Add -z,lazy to LDFLAGS
- Add imex channels to management CDI spec
- Add support to fetch current container runtime config from the command line.
- Add creation of select driver symlinks to CDI spec generation.
- Remove support for config overrides when configuring runtimes.
- Skip explicit creation of libnvidia-allocator.so.1 symlink
- Add vdpau as as a driver library search path.
- Add support for using libnvsandboxutils to generate CDI specifications.
### Changes in the Toolkit Container
- Allow opt-in features to be selected when deploying the toolkit-container.
- Bump CUDA base image version to 12.6.2
- Remove support for config overrides when configuring runtimes.
### Changes in libnvidia-container
- Add no-create-imex-channels command line option.
## v1.16.2
- Exclude libnvidia-allocator from graphics mounts. This fixes a bug that leaks mounts when a container is started with bi-directional mount propagation.
- Use empty string for default runtime-config-override. This removes a redundant warning for runtimes (e.g. Docker) where this is not applicable.
### Changes in the Toolkit Container
- Bump CUDA base image version to 12.6.0
### Changes in libnvidia-container
- Add no-gsp-firmware command line option
- Add no-fabricmanager command line option
- Add no-persistenced command line option
- Skip directories and symlinks when mounting libraries.
## v1.16.1
- Fix bug with processing errors during CDI spec generation for MIG devices
## v1.16.0
- Promote v1.16.0-rc.2 to v1.16.0
### Changes in the Toolkit Container
- Bump CUDA base image version to 12.5.1
## v1.16.0-rc.2
- Use relative path to locate driver libraries
- Add RelativeToRoot function to Driver
- Inject additional libraries for full X11 functionality
- Extract options from default runtime if runc does not exist
- Avoid using map pointers as maps are always passed by reference
- Reduce logging for the NVIDIA Container runtime
- Fix bug in argument parsing for logger creation
## v1.16.0-rc.1
- Support vulkan ICD files directly in a driver root. This allows for the discovery of vulkan files in GKE driver installations.
- Increase priority of ld.so.conf.d config file injected into container. This ensures that injected libraries are preferred over libraries present in the container.
- Set default CDI spec permissions to 644. This fixes permission issues when using the `nvidia-ctk cdi transform` functions.
- Add `dev-root` option to `nvidia-ctk system create-device-nodes` command.
- Fix location of `libnvidia-ml.so.1` when a non-standard driver root is used. This enabled CDI spec generation when using the driver container on a host.
- Recalculate minimum required CDI spec version on save.
- Move `nvidia-ctk hook` commands to a separate `nvidia-cdi-hook` binary. The same subcommands are supported.
- Use `:` as an `nvidia-ctk config --set` list separator. This fixes a bug when trying to set config options that are lists.
- [toolkit-container] Bump CUDA base image version to 12.5.0
- [toolkit-container] Allow the path to `toolkit.pid` to be specified directly.
- [toolkit-container] Remove provenance information from image manifests.
- [toolkit-container] Add `dev-root` option when configuring the toolkit. This adds support for GKE driver installations.
## v1.15.0
* Remove `nvidia-container-runtime` and `nvidia-docker2` packages.
* Use `XDG_DATA_DIRS` environment variable when locating config files such as graphics config files.
* Add support for v0.7.0 Container Device Interface (CDI) specification.
* Add `--config-search-path` option to `nvidia-ctk cdi generate` command. These paths are used when locating driver files such as graphics config files.
* Use D3DKMTEnumAdapters3 to enumerate adpaters on WSL2 if available.
* Add support for v1.2.0 OCI Runtime specification.
* Explicitly set `NVIDIA_VISIBLE_DEVICES=void` in generated CDI specifications. This prevents the NVIDIA Container Runtime from making additional modifications.
* [libnvidia-container] Use D3DKMTEnumAdapters3 to enumerate adpaters on WSL2 if available.
* [toolkit-container] Bump CUDA base image version to 12.4.1
## v1.15.0-rc.4
* Add a `--spec-dir` option to the `nvidia-ctk cdi generate` command. This allows specs outside of `/etc/cdi` and `/var/run/cdi` to be processed.
* Add support for extracting device major number from `/proc/devices` if `nvidia` is used as a device name over `nvidia-frontend`.
* Allow multiple device naming strategies for `nvidia-ctk cdi generate` command. This allows a single
CDI spec to be generated that includes GPUs by index and UUID.
* Set the default `--device-name-strategy` for the `nvidia-ctk cdi generate` command to `[index, uuid]`.
* Remove `libnvidia-container0` jetpack dependency included for legacy Tegra-based systems.
* Add `NVIDIA_VISIBLE_DEVICES=void` to generated CDI specifications.
* [toolkit-container] Remove centos7 image. The ubi8 image can be used on all RPM-based platforms.
* [toolkit-container] Bump CUDA base image version to 12.3.2
## v1.15.0-rc.3
* Fix bug in `nvidia-ctk hook update-ldcache` where default `--ldconfig-path` value was not applied.
## v1.15.0-rc.2
* Extend the `runtime.nvidia.com/gpu` CDI kind to support full-GPUs and MIG devices specified by index or UUID.
* Fix bug when specifying `--dev-root` for Tegra-based systems.
* Log explicitly requested runtime mode.
* Remove package dependency on libseccomp.
* Added detection of libnvdxgdmal.so.1 on WSL2
* Use devRoot to resolve MIG device nodes.
* Fix bug in determining default nvidia-container-runtime.user config value on SUSE-based systems.
* Add `crun` to the list of configured low-level runtimes.
* Added support for `--ldconfig-path` to `nvidia-ctk cdi generate` command.
* Fix `nvidia-ctk runtime configure --cdi.enabled` for Docker.
* Add discovery of the GDRCopy device (`gdrdrv`) if the `NVIDIA_GDRCOPY` environment variable of the container is set to `enabled`
* [toolkit-container] Bump CUDA base image version to 12.3.1.
## v1.15.0-rc.1
* Skip update of ldcache in containers without ldconfig. The .so.SONAME symlinks are still created.
* Normalize ldconfig path on use. This automatically adjust the ldconfig setting applied to ldconfig.real on systems where this exists.
* Include `nvidia/nvoptix.bin` in list of graphics mounts.
* Include `vulkan/icd.d/nvidia_layers.json` in list of graphics mounts.
* Add support for `--library-search-paths` to `nvidia-ctk cdi generate` command.
* Add support for injecting /dev/nvidia-nvswitch* devices if the NVIDIA_NVSWITCH=enabled envvar is specified.
* Added support for `nvidia-ctk runtime configure --enable-cdi` for the `docker` runtime. Note that this requires Docker >= 25.
* Fixed bug in `nvidia-ctk config` command when using `--set`. The types of applied config options are now applied correctly.
* Add `--relative-to` option to `nvidia-ctk transform root` command. This controls whether the root transformation is applied to host or container paths.
* Added automatic CDI spec generation when the `runtime.nvidia.com/gpu=all` device is requested by a container.
* [libnvidia-container] Fix device permission check when using cgroupv2 (fixes #227)
## v1.14.3
* [toolkit-container] Bump CUDA base image version to 12.2.2.
## v1.14.2
* Fix bug on Tegra-based systems where symlinks were not created in containers.
* Add --csv.ignore-pattern command line option to nvidia-ctk cdi generate command.
## v1.14.1
* Fixed bug where contents of `/etc/nvidia-container-runtime/config.toml` is ignored by the NVIDIA Container Runtime Hook.
* [libnvidia-container] Use libelf.so on RPM-based systems due to removed mageia repositories hosting pmake and bmake.
## v1.14.0
* Promote v1.14.0-rc.3 to v1.14.0
## v1.14.0-rc.3
* Added support for generating OCI hook JSON file to `nvidia-ctk runtime configure` command.
* Remove installation of OCI hook JSON from RPM package.
* Refactored config for `nvidia-container-runtime-hook`.
* Added a `nvidia-ctk config` command which supports setting config options using a `--set` flag.
* Added `--library-search-path` option to `nvidia-ctk cdi generate` command in `csv` mode. This allows folders where
libraries are located to be specified explicitly.
* Updated go-nvlib to support devices which are not present in the PCI device database. This allows the creation of dev/char symlinks on systems with such devices installed.
* Added `UsesNVGPUModule` info function for more robust platform detection. This is required on Tegra-based systems where libnvidia-ml.so is also supported.
* [toolkit-container] Set `NVIDIA_VISIBLE_DEVICES=void` to prevent injection of NVIDIA devices and drivers into the NVIDIA Container Toolkit container.
## v1.14.0-rc.2
* Fix bug causing incorrect nvidia-smi symlink to be created on WSL2 systems with multiple driver roots.
* Remove dependency on coreutils when installing package on RPM-based systems.
* Create output folders if required when running `nvidia-ctk runtime configure`
* Generate default config as post-install step.
* Added support for detecting GSP firmware at custom paths when generating CDI specifications.
* Added logic to skip the extraction of image requirements if `NVIDIA_DISABLE_REQUIRES` is set to `true`.
* [libnvidia-container] Include Shared Compiler Library (libnvidia-gpucomp.so) in the list of compute libaries.
* [toolkit-container] Ensure that common envvars have higher priority when configuring the container engines.
* [toolkit-container] Bump CUDA base image version to 12.2.0.
* [toolkit-container] Remove installation of nvidia-experimental runtime. This is superceded by the NVIDIA Container Runtime in CDI mode.
## v1.14.0-rc.1
* Add support for updating containerd configs to the `nvidia-ctk runtime configure` command.
* Create file in `etc/ld.so.conf.d` with permissions `644` to support non-root containers.
* Generate CDI specification files with `644` permissions to allow rootless applications (e.g. podman)
* Add `nvidia-ctk cdi list` command to show the known CDI devices.
* Add support for generating merged devices (e.g. `all` device) to the nvcdi API.
* Use *.* pattern to locate libcuda.so when generating a CDI specification to support platforms where a patch version is not specified.
* Update go-nvlib to skip devices that are not MIG capable when generating CDI specifications.
* Fix bug in creation of `/dev/char` symlinks by failing operation if kernel modules are not loaded.
* Add option to load kernel modules when creating device nodes
* Add option to create device nodes when creating `/dev/char` symlinks
* [libnvidia-container] Support OpenSSL 3 with the Encrypt/Decrypt library
* [toolkit-container] Allow same envars for all runtime configs
## v1.13.1
* Update `update-ldcache` hook to only update ldcache if it exists.
* Update `update-ldcache` hook to create `/etc/ld.so.conf.d` folder if it doesn't exist.
* Fix failure when libcuda cannot be located during XOrg library discovery.
* Fix CDI spec generation on systems that use `/etc/alternatives` (e.g. Debian)
## v1.13.0
* Promote 1.13.0-rc.3 to 1.13.0
## v1.13.0-rc.3
* Only initialize NVML for modes that require it when runing `nvidia-ctk cdi generate`.
* Prefer /run over /var/run when locating nvidia-persistenced and nvidia-fabricmanager sockets.
* Fix the generation of CDI specifications for management containers when the driver libraries are not in the LDCache.
* Add transformers to deduplicate and simplify CDI specifications.
* Generate a simplified CDI specification by default. This means that entities in the common edits in a spec are not included in device definitions.
* Also return an error from the nvcdi.New constructor instead of panicing.
* Detect XOrg libraries for injection and CDI spec generation.
* Add `nvidia-ctk system create-device-nodes` command to create control devices.
* Add `nvidia-ctk cdi transform` command to apply transforms to CDI specifications.
* Add `--vendor` and `--class` options to `nvidia-ctk cdi generate`
* [libnvidia-container] Fix segmentation fault when RPC initialization fails.
* [libnvidia-container] Build centos variants of the NVIDIA Container Library with static libtirpc v1.3.2.
* [libnvidia-container] Remove make targets for fedora35 as the centos8 packages are compatible.
* [toolkit-container] Add `nvidia-container-runtime.modes.cdi.annotation-prefixes` config option that allows the CDI annotation prefixes that are read to be overridden.
* [toolkit-container] Create device nodes when generating CDI specification for management containers.
* [toolkit-container] Add `nvidia-container-runtime.runtimes` config option to set the low-level runtime for the NVIDIA Container Runtime
## v1.13.0-rc.2
* Don't fail chmod hook if paths are not injected
* Only create `by-path` symlinks if CDI devices are actually requested.
* Fix possible blank `nvidia-ctk` path in generated CDI specifications
* Fix error in postun scriplet on RPM-based systems
* Only check `NVIDIA_VISIBLE_DEVICES` for environment variables if no annotations are specified.
* Add `cdi.default-kind` config option for constructing fully-qualified CDI device names in CDI mode
* Add support for `accept-nvidia-visible-devices-envvar-unprivileged` config setting in CDI mode
* Add `nvidia-container-runtime-hook.skip-mode-detection` config option to bypass mode detection. This allows `legacy` and `cdi` mode, for example, to be used at the same time.
* Add support for generating CDI specifications for GDS and MOFED devices
* Ensure CDI specification is validated on save when generating a spec
* Rename `--discovery-mode` argument to `--mode` for `nvidia-ctk cdi generate`
* [libnvidia-container] Fix segfault on WSL2 systems
* [toolkit-container] Add `--cdi-enabled` flag to toolkit config
* [toolkit-container] Install `nvidia-ctk` from toolkit container
* [toolkit-container] Use installed `nvidia-ctk` path in NVIDIA Container Toolkit config
* [toolkit-container] Bump CUDA base images to 12.1.0
* [toolkit-container] Set `nvidia-ctk` path in the
* [toolkit-container] Add `cdi.k8s.io/*` to set of allowed annotations in containerd config
* [toolkit-container] Generate CDI specification for use in management containers
* [toolkit-container] Install experimental runtime as `nvidia-container-runtime.experimental` instead of `nvidia-container-runtime-experimental`
* [toolkit-container] Install and configure mode-specific runtimes for `cdi` and `legacy` modes
## v1.13.0-rc.1
* Include MIG-enabled devices as GPUs when generating CDI specification
* Fix missing NVML symbols when running `nvidia-ctk` on some platforms [#49]
* Add CDI spec generation for WSL2-based systems to `nvidia-ctk cdi generate` command
* Add `auto` mode to `nvidia-ctk cdi generate` command to automatically detect a WSL2-based system over a standard NVML-based system.
* Add mode-specific (`.cdi` and `.legacy`) NVIDIA Container Runtime binaries for use in the GPU Operator
* Discover all `gsb*.bin` GSP firmware files when generating CDI specification.
* Align `.deb` and `.rpm` release candidate package versions
* Remove `fedora35` packaging targets
* [libnvidia-container] Include all `gsp*.bin` firmware files if present
* [libnvidia-container] Align `.deb` and `.rpm` release candidate package versions
* [toolkit-container] Install `nvidia-container-toolkit-operator-extensions` package for mode-specific executables.
* [toolkit-container] Allow `nvidia-container-runtime.mode` to be set when configuring the NVIDIA Container Toolkit
## v1.12.0
* Promote `v1.12.0-rc.5` to `v1.12.0`
* Rename `nvidia cdi generate``--root` flag to `--driver-root` to better indicate intent
* [libnvidia-container] Add nvcubins.bin to DriverStore components under WSL2
* [toolkit-container] Bump CUDA base images to 12.0.1
## v1.12.0-rc.5
* Fix bug here the `nvidia-ctk` path was not properly resolved. This causes failures to run containers when the runtime is configured in `csv` mode or if the `NVIDIA_DRIVER_CAPABILITIES` includes `graphics` or `display` (e.g. `all`).
## v1.12.0-rc.4
* Generate a minimum CDI spec version for improved compatibility.
* Add `--device-name-strategy` options to the `nvidia-ctk cdi generate` command that can be used to control how device names are constructed.
* Set default for CDI device name generation to `index` to generate device names such as `nvidia.com/gpu=0` or `nvidia.com/gpu=1:0` by default.
## v1.12.0-rc.3
* Don't fail if by-path symlinks for DRM devices do not exist
* Replace the --json flag with a --format [json|yaml] flag for the nvidia-ctk cdi generate command
* Ensure that the CDI output folder is created if required
* When generating a CDI specification use a blank host path for devices to ensure compatibility with the v0.4.0 CDI specification
* Add injection of Wayland JSON files
* Add GSP firmware paths to generated CDI specification
* Add --root flag to nvidia-ctk cdi generate command
## v1.12.0-rc.2
* Inject Direct Rendering Manager (DRM) devices into a container using the NVIDIA Container Runtime
* Improve logging of errors from the NVIDIA Container Runtime
* Improve CDI specification generation to support rootless podman
* Use `nvidia-ctk cdi generate` to generate CDI specifications instead of `nvidia-ctk info generate-cdi`
* [libnvidia-container] Skip creation of existing files when these are already mounted
## v1.12.0-rc.1
* Add support for multiple Docker Swarm resources
* Improve injection of Vulkan configurations and libraries
* Add `nvidia-ctk info generate-cdi` command to generated CDI specification for available devices
* [libnvidia-container] Include NVVM compiler library in compute libs
## v1.11.0
* Promote v1.11.0-rc.3 to v1.11.0
## v1.11.0-rc.3
* Build fedora35 packages
* Introduce an `nvidia-container-toolkit-base` package for better dependency management
* Fix removal of `nvidia-container-runtime-hook` on RPM-based systems
* Inject platform files into container on Tegra-based systems
* [toolkit container] Update CUDA base images to 11.7.1
* [libnvidia-container] Preload libgcc_s.so.1 on arm64 systems
## v1.11.0-rc.2
* Allow `accept-nvidia-visible-devices-*` config options to be set by toolkit container
* [libnvidia-container] Fix bug where LDCache was not updated when the `--no-pivot-root` option was specified
## v1.11.0-rc.1
* Add discovery of GPUDirect Storage (`nvidia-fs*`) devices if the `NVIDIA_GDS` environment variable of the container is set to `enabled`
* Add discovery of MOFED Infiniband devices if the `NVIDIA_MOFED` environment variable of the container is set to `enabled`
* Fix bug in CSV mode where libraries listed as `sym` entries in mount specification are not added to the LDCache.
* Rename `nvidia-container-toolkit` executable to `nvidia-container-runtime-hook` and create `nvidia-container-toolkit` as a symlink to `nvidia-container-runtime-hook` instead.
* Add `nvidia-ctk runtime configure` command to configure the Docker config file (e.g. `/etc/docker/daemon.json`) for use with the NVIDIA Container Runtime.
## v1.10.0
* Promote v1.10.0-rc.3 to v1.10.0
## v1.10.0-rc.3
* Use default config instead of raising an error if config file cannot be found
* Ignore NVIDIA_REQUIRE_JETPACK* environment variables for requirement checks
* Fix bug in detection of Tegra systems where `/sys/devices/soc0/family` is ignored
* Fix bug where links to devices were detected as devices
* [libnvida-container] Fix bug introduced when adding libcudadebugger.so to list of libraries
## v1.10.0-rc.2
* Add support for NVIDIA_REQUIRE_* checks for cuda version and arch to csv mode
* Switch to debug logging to reduce log verbosity
* Support logging to logs requested in command line
* Fix bug when launching containers with relative root path (e.g. using containerd)
* Allow low-level runtime path to be set explicitly as nvidia-container-runtime.runtimes option
* Fix failure to locate low-level runtime if PATH envvar is unset
* Replace experimental option for NVIDIA Container Runtime with nvidia-container-runtime.mode = csv option
* Use csv as default mode on Tegra systems without NVML
* Add --version flag to all CLIs
* [libnvidia-container] Bump libtirpc to 1.3.2
* [libnvidia-container] Fix bug when running host ldconfig using glibc compiled with a non-standard prefix
* [libnvidia-container] Add libcudadebugger.so to list of compute libraries
## v1.10.0-rc.1
* Include nvidia-ctk CLI in installed binaries
* Add experimental option to NVIDIA Container Runtime
## v1.9.0
* [libnvidia-container] Add additional check for Tegra in /sys/.../family file in CLI
* [libnvidia-container] Update jetpack-specific CLI option to only load Base CSV files by default
* [libnvidia-container] Fix bug (from 1.8.0) when mounting GSP firmware into containers without /lib to /usr/lib symlinks
* [libnvidia-container] Update nvml.h to CUDA 11.6.1 nvML_DEV 11.6.55
* [libnvidia-container] Update switch statement to include new brands from latest nvml.h
* [libnvidia-container] Process all --require flags on Jetson platforms
* [libnvidia-container] Fix long-standing issue with running ldconfig on Debian systems
## v1.8.1
* [libnvidia-container] Fix bug in determining cgroup root when running in nested containers
* [libnvidia-container] Fix permission issue when determining cgroup version
## v1.8.0
* Promote 1.8.0-rc.2-1 to 1.8.0
## v1.8.0-rc.2
* Remove support for building amazonlinux1 packages
## v1.8.0-rc.1
* [libnvidia-container] Add support for cgroupv2
* Release toolkit-container images from nvidia-container-toolkit repository
## v1.7.0
* Promote 1.7.0-rc.1-1 to 1.7.0
* Bump Golang version to 1.16.4
## v1.7.0-rc.1
* Specify containerd runtime type as string in config tools to remove dependency on containerd package
* Add supported-driver-capabilities config option to allow for a subset of all driver capabilities to be specified
## v1.6.0
* Promote 1.6.0-rc.3-1 to 1.6.0
* Fix unnecessary logging to stderr instead of configured nvidia-container-runtime log file
## v1.6.0-rc.3
* Add supported-driver-capabilities config option to the nvidia-container-toolkit
* Move OCI and command line checks for runtime to internal oci package
## v1.6.0-rc.2
* Use relative path to OCI specification file (config.json) if bundle path is not specified as an argument to the nvidia-container-runtime
## v1.6.0-rc.1
* Add AARCH64 package for Amazon Linux 2
* Include nvidia-container-runtime into nvidia-container-toolkit package
## v1.5.1
* Fix bug where Docker Swarm device selection is ignored if NVIDIA_VISIBLE_DEVICES is also set
* Improve unit testing by using require package and adding coverage reports
* Remove unneeded go dependencies by running go mod tidy
* Move contents of pkg directory to cmd for CLI tools
* Ensure make binary target explicitly sets GOOS
## v1.5.0
* Add dependence on libnvidia-container-tools >= 1.4.0
* Add golang check targets to Makefile
* Add Jenkinsfile definition for build targets
* Move docker.mk to docker folder
## v1.4.2
* Add dependence on libnvidia-container-tools >= 1.3.3
## v1.4.1
* Ignore NVIDIA_VISIBLE_DEVICES for containers with insufficent privileges
* Add dependence on libnvidia-container-tools >= 1.3.2
## v1.4.0
* Add 'compute' capability to list of defaults
* Add dependence on libnvidia-container-tools >= 1.3.1
## v1.3.0
* Promote 1.3.0-rc.2-1 to 1.3.0
* Add dependence on libnvidia-container-tools >= 1.3.0
## v1.3.0-rc.2
* 2c180947 Add more tests for new semantics with device list from volume mounts
* 7c003857 Refactor accepting device lists from volume mounts as a boolean
## v1.3.0-rc.1
* b50d86c1 Update build system to accept a TAG variable for things like rc.x
* fe65573b Add common CI tests for things like golint, gofmt, unit tests, etc.
* da6fbb34 Revert "Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_*"
* a7fb3330 Flip build-all targets to run automatically on merge requests
* 8b248b66 Rename github.com/NVIDIA/container-toolkit to nvidia-container-toolkit
* da36874e Add new config options to pull device list from mounted files instead of ENVVAR
## v1.2.1
* 4e6e0ed4 Add 'ngx' to list of*all* driver capabilities
* 2f4af743 List config.toml as a config file in the RPM SPEC
## v1.2.0
* 8e0aab46 Fix repo listed in changelog for debian distributions
* 320bb6e4 Update dependence on libnvidia-container to 1.2.0
* 6cfc8097 Update package license to match source license
* e7dc3cbb Fix debian copyright file
* d3aee3e0 Add the 'ngx' driver capability
## v1.1.2
* c32237f3 Add support for parsing Linux Capabilities for older OCI specs
## v1.1.1
* d202aded Update dependence to libnvidia-container 1.1.1
## v1.1.0
* 4e4de762 Update build system to support multi-arch builds
* fcc1d116 Add support for MIG (Multi-Instance GPUs)
* d4ff0416 Add ability to merge envars of the form NVIDIA_VISIBLE_DEVICES_*
@@ -13,13 +13,15 @@ The `nvidia-container-toolkit` resides in this repo directly.
In oder to build the packages, the following command is executed
```sh
./scripts/build-all-components.sh TARGET
./scripts/build-packages.sh TARGET
```
where `TARGET` is a make target that is valid for each of the sub-components.
These include:
*`ubuntu18.04-amd64`
*`centos8-x86_64`
*`centos7-x86_64`
If no `TARGET` is specified, all valid release targets are built.
The packages are generated in the `dist` folder.
@@ -32,14 +34,28 @@ environment variables.
## Testing packages locally
The [test/release](./test/release/) folder contains documentation on how the installation of local or staged packages can be tested.
The [tests/release](./tests/release/) folder contains documentation on how the installation of local or staged packages can be tested.
## Releasing
A utility script [`scripts/release.sh`](./scripts/release.sh) is provided to build
packages required for release. If run without arguments, all supported distribution-architecture combinations are built. A specific distribution-architecture pair can also be provided
```sh
./scripts/release.sh ubuntu18.04-amd64
In order to release packages required for a release, a utility script
[`scripts/release-packages.sh`](./scripts/release-packages.sh) is provided.
where the `amd64` builds for `ubuntu18.04` are provided as an example.
Where `REPO` is one of `stable` or `experimental`, `PACKAGE_REPO_ROOT` is the local path to the `libnvidia-container` repository checked out to the `gh-pages` branch, and `REFERENCE` is the git SHA that is to be released. If reference is not specified `HEAD` is assumed.
This scripts performs the following basic functions:
* Pulls the package image defined by the `REFERENCE` git SHA from the staging registry,
* Copies the required packages to the package repository at `PACKAGE_REPO_ROOT/REPO`,
* Signs the packages using the specified GPG keys
While the last two are performed, commits are added to the package repository. These can be pushed to the relevant repository.
@@ -28,4 +28,4 @@ The [user guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolk
[Checkout the Contributing document!](CONTRIBUTING.md)
* Please let us know by [filing a new issue](https://github.com/NVIDIA/nvidia-container-toolkit/issues/new)
* You can contribute by creating a [merge request](https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/new) to our public GitLab repository
* You can contribute by creating a [pull request](https://github.com/NVIDIA/nvidia-container-toolkit/compare) to our public GitHub repository
The NVIDIA Container Toolkit consists of the following artifacts:
- The NVIDIA Container Toolkit container
- Packages for debian-based systems
- Packages for rpm-based systems
# Release Process Checklist:
- [ ] Create a release PR:
- [ ] Run the `./hack/prepare-release.sh` script to update the version in all the needed files. This also creates a [release issue](https://github.com/NVIDIA/cloud-native-team/issues?q=is%3Aissue+is%3Aopen+label%3Arelease)
- [ ] Run the `./hack/generate-changelog.sh` script to generate the a draft changelog and update `CHANGELOG.md` with the changes.
- [ ] Create a PR from the created `bump-release-{{ .VERSION }}` branch.
- [ ] Merge the release PR
- [ ] Tag the release and push the tag to the `internal` mirror:
NVIDIA is dedicated to the security and trust of our software products and services, including all source code repositories managed through our organization.
If you need to report a security issue, please use the appropriate contact points outlined below. **Please do not report security vulnerabilities through GitHub.**
## Reporting Potential Security Vulnerability in an NVIDIA Product
To report a potential security vulnerability in any NVIDIA product:
- We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
- Please include the following information:
- Product/Driver name and version/branch that contains the vulnerability
- Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
- Instructions to reproduce the vulnerability
- Proof-of-concept or exploit code
- Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
While NVIDIA currently does not have a bug bounty program, we do offer acknowledgement when an externally reported security issue is addressed under our coordinated vulnerability disclosure policy. Please visit our [Product Security Incident Response Team (PSIRT)](https://www.nvidia.com/en-us/security/psirt-policies/) policies page for more information.
## NVIDIA Product Security
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
RUNGOPATH=/artifacts go install github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime.experimental@${NVIDIA_CONTAINER_RUNTIME_EXPERIMENTAL_VERSION}
WORKDIR/build
COPY . .
# NOTE: Until the config utilities are properly integrated into the
# nvidia-container-toolkit repository, these are built from the `tools` folder
# and not `cmd`.
RUNGOPATH=/artifacts go install -ldflags="-s -w -X 'main.Version=${VERSION}'" ./tools/...
FROMnvidia/cuda:${CUDA_VERSION}-base-${BASE_DIST}
ARG BASE_DIST
# See https://www.centos.org/centos-linux-eol/
# and https://stackoverflow.com/a/70930049 for move to vault.centos.org
# and https://serverfault.com/questions/1093922/failing-to-run-yum-update-in-centos-8 for move to vault.epel.cloud
RUN[["${BASE_DIST}" !="centos8"]]||\
(\
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-* &&\
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.epel.cloud|g' /etc/yum.repos.d/CentOS-Linux-* \
RUNGOPATH=/artifacts go install github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime.experimental@${NVIDIA_CONTAINER_RUNTIME_EXPERIMENTAL_VERSION}
WORKDIR/build
COPY . .
# NOTE: Until the config utilities are properly integrated into the
# nvidia-container-toolkit repository, these are built from the `tools` folder
# and not `cmd`.
RUNGOPATH=/artifacts go install -ldflags="-s -w -X 'main.Version=${VERSION}'" ./tools/...
Usage:"This hook ensures that the folder containing the CUDA compat libraries is added to the ldconfig search path if required.",
Before:func(c*cli.Context)error{
returnm.validateFlags(c,&cfg)
},
Action:func(c*cli.Context)error{
returnm.run(c,&cfg)
},
}
c.Flags=[]cli.Flag{
&cli.StringFlag{
Name:"host-driver-version",
Usage:"Specify the host driver version. If the CUDA compat libraries detected in the container do not have a higher MAJOR version, the hook is a no-op.",
Destination:&cfg.hostDriverVersion,
},
&cli.StringFlag{
Name:"container-spec",
Hidden:true,
Category:"testing-only",
Usage:"Specify the path to the OCI container spec. If empty or '-' the spec will be read from STDIN",
log.Panicf("Invalid value for config option '%v'; %v (supported: %v)\n",configName,config.SupportedDriverCapabilities,allSupportedDriverCapabilities.String())
}
returnconfig,nil
}
// getConfigOption returns the toml config option associated with the
returnfmt.Errorf("invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead")
log.Panicln("invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime instead.")
The NVIDIA Container Runtime is a shim for OCI-compliant low-level runtimes such as [runc](https://github.com/opencontainers/runc). When a `create` command is detected, the incoming [OCI runtime specification](https://github.com/opencontainers/runtime-spec) is modified in place and the command is forwarded to the low-level runtime.
## Standard Mode
## Configuration
In the standard mode configuration, the NVIDIA Container Runtime adds a [`prestart` hook](https://github.com/opencontainers/runtime-spec/blob/master/config.md#prestart) to the incomming OCI specification that invokes the NVIDIA Container Runtime Hook for all containers created. This hook checks whether NVIDIA devices are requested and ensures GPU access is configured using the `nvidia-container-cli` from project [libnvidia-container](https://github.com/NVIDIA/libnvidia-container).
The NVIDIA Container Runtime uses file-based configuration, with the config stored in `/etc/nvidia-container-runtime/config.toml`. The `/etc` path can be overridden using the `XDG_CONFIG_HOME` environment variable with the `${XDG_CONFIG_HOME}/nvidia-container-runtime/config.toml` file used instead if this environment variable is set.
## Experimental Mode
This config file may contain options for other components of the NVIDIA container stack and for the NVIDIA Container Runtime, the relevant config section is `nvidia-container-runtime`
The NVIDIA Container Runtime can be configured in an experimental mode by setting the following options in the runtime's `config.toml` file:
### Logging
The `log-level` config option (default: `"info"`) specifies the log level to use and the `debug` option, if set, specifies a log file to which logs for the NVIDIA Container Runtime must be written.
In addition to this, the NVIDIA Container Runtime considers the value of `--log` and `--log-format` flags that may be passed to it by a container runtime such as docker or containerd. If the `--debug` flag is present the log-level specified in the config file is overridden as `"debug"`.
### Low-level Runtime Path
The `runtimes` config option allows for the low-level runtime to be specified. The first entry in this list that is an existing executable file is used as the low-level runtime. If the entry is not a path, the `PATH` is searched for a matching executable. If the entry is a path this is checked instead.
The default value for this setting is:
```toml
runtimes=[
"runc",
"crun",
]
```
and if, for example, `crun` is to be used instead this can be changed to:
```toml
runtimes=[
"crun",
]
```
### Runtime Mode
The `mode` config option (default `"auto"`) controls the high-level behaviour of the runtime.
#### Auto Mode
When `mode` is set to `"auto"`, the runtime employs heuristics to determine which mode to use based on, for example, the platform where the runtime is being run.
#### Legacy Mode
When `mode` is set to `"legacy"`, the NVIDIA Container Runtime adds a [`prestart` hook](https://github.com/opencontainers/runtime-spec/blob/master/config.md#prestart) to the incomming OCI specification that invokes the NVIDIA Container Runtime Hook for all containers created. This hook checks whether NVIDIA devices are requested and ensures GPU access is configured using the `nvidia-container-cli` from the [libnvidia-container](https://github.com/NVIDIA/libnvidia-container) project.
#### CSV Mode
When `mode` is set to `"csv"`, CSV files at `/etc/nvidia-container-runtime/host-files-for-container.d` define the devices and mounts that are to be injected into a container when it is created. The search path for the files can be overridden by modifying the `nvidia-container-runtime.modes.csv.mount-spec-path` in the config as below:
When this setting is enabled, the modifications made to the OCI specification are controlled by the `nvidia-container-runtime.discover-mode` option, with the following mode supported:
*`"legacy"`: This mode mirrors the behaviour of the standard mode, inserting the NVIDIA Container Runtime Hook as a `prestart` hook into the container's OCI specification.
*`"csv"`: This mode uses CSV files at `/etc/nvidia-container-runtime/host-files-for-container.d` to define the devices and mounts that are to be injected into a container when it is created.
This mode is primarily targeted at Tegra-based systems without NVML available.
### Notes on using the docker CLI
The `docker` CLI supports the `--gpus` flag to select GPUs for inclusion in a container. Since specifying this flag inserts the same NVIDIA Container Runtime Hook into the OCI runtime specification. When experimental mode is activated, the NVIDIA Container Runtime detects the presence of the hook and raises an error. This requirement will be relaxed in the near future.
Note that only the `"legacy"` NVIDIA Container Runtime mode is directly compatible with the `--gpus` flag implemented by the `docker` CLI (assuming the NVIDIA Container Runtime is not used). The reason for this is that `docker` inserts the same NVIDIA Container Runtime Hook into the OCI runtime specification.
If a different mode is explicitly set or detected, the NVIDIA Container Runtime Hook will raise the following error when `--gpus` is set:
```
$ docker run --rm --gpus all ubuntu:18.04
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime instead.: unknown.
```
Here NVIDIA Container Runtime must be used explicitly. The recommended way to do this is to specify the `--runtime=nvidia` command line argument as part of the `docker run` commmand as follows:
```
$ docker run --rm --gpus all --runtime=nvidia ubuntu:18.04
```
Alternatively the NVIDIA Container Runtime can be set as the default runtime for docker. This can be done by modifying the `/etc/docker/daemon.json` file as follows:
```json
{
"default-runtime":"nvidia",
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[]
}
}
}
```
## Environment variables (OCI spec)
Each environment variable maps to an command-line argument for `nvidia-container-cli` from [libnvidia-container](https://github.com/NVIDIA/libnvidia-container).
These variables are already set in our [official CUDA images](https://hub.docker.com/r/nvidia/cuda/).
### `NVIDIA_VISIBLE_DEVICES`
This variable controls which GPUs will be made accessible inside the container.
#### Possible values
*`0,1,2`, `GPU-fef8089b` …: a comma-separated list of GPU UUID(s) or index(es).
*`all`: all GPUs will be accessible, this is the default value in our container images.
*`none`: no GPU will be accessible, but driver capabilities will be enabled.
*`void` or *empty* or *unset*: `nvidia-container-runtime` will have the same behavior as `runc`.
**Note**: When running on a MIG capable device, the following values will also be available:
*`0:0,0:1,1:0`, `MIG-GPU-fef8089b/0/1` …: a comma-separated list of MIG Device UUID(s) or index(es).
Where the MIG device indices have the form `<GPU Device Index>:<MIG Device Index>` as seen in the example output:
Configure the `nvidia-container-runtime` as a docker runtime named `NAME`. If the `--runtime-name` flag is not specified, this runtime would be called `nvidia`. A runtime named `nvidia-experimental` will also be configured using the `nvidia-container-runtime-experimental` OCI-compliant runtime shim.
Configure the `nvidia-container-runtime` as a docker runtime named `NAME`. If the `--runtime-name` flag is not specified, this runtime would be called `nvidia`.
Since `--set-as-default` is enabled by default, the specified runtime name will also be set as the default docker runtime. This can be disabled by explicityly specifying `--set-as-default=false`.
**Note**: If `--runtime-name` is specified as `nvidia-experimental` explicitly, the `nvidia-experimental` runtime will be configured as the default runtime, with the `nvidia` runtime still configured and available for use.
The following table describes the behaviour for different `--runtime-name` and `--set-as-default` flag combinations.
These combinations also hold for the environment variables that map to the command line flags: `DOCKER_RUNTIME_NAME`, `DOCKER_SET_AS_DEFAULT`.
@@ -48,7 +43,7 @@ containerd setup \
/run/nvidia/toolkit
```
Configure the `nvidia-container-runtime` as a runtime class named `NAME`. If the `--runtime-class` flag is not specified, this runtime would be called `nvidia`. A runtime class named `nvidia-experimental` will also be configured using the `nvidia-container-runtime-experimental` OCI-compliant runtime shim.
Configure the `nvidia-container-runtime` as a runtime class named `NAME`. If the `--runtime-class` flag is not specified, this runtime would be called `nvidia`.
Adding the `--set-as-default` flag as follows:
```bash
@@ -59,19 +54,15 @@ containerd setup \
```
will set the runtime class `NAME` (or `nvidia` if not specified) as the default runtime class.
**Note**: If `--runtime-class` is specified as `nvidia-experimental` explicitly and `--set-as-default` is specified, the `nvidia-experimental` runtime will be configured as the default runtime class, with the `nvidia` runtime class still configured and available for use.
The following table describes the behaviour for different `--runtime-class` and `--set-as-default` flag combinations.
// NewApp creates the CLI app fro the specified options.
funcNewApp(loggerlogger.Interface)*cli.App{
a:=app{
logger:logger,
}
returna.build()
}
func(aapp)build()*cli.App{
options:=options{
toolkitOptions:toolkit.Options{},
}
// Create the top-level CLI
c:=cli.NewApp()
c.Name="nvidia-ctk-installer"
c.Usage="Install the NVIDIA Container Toolkit and configure the specified runtime to use the `nvidia` runtime."
c.Version=info.GetVersionString()
c.Before=func(ctx*cli.Context)error{
returna.Before(ctx,&options)
}
c.Action=func(ctx*cli.Context)error{
returna.Run(ctx,&options)
}
// Setup flags for the CLI
c.Flags=[]cli.Flag{
&cli.BoolFlag{
Name:"no-daemon",
Aliases:[]string{"n"},
Usage:"terminate immediately after setting up the runtime. Note that no cleanup will be performed",
Destination:&options.noDaemon,
EnvVars:[]string{"NO_DAEMON"},
},
&cli.StringFlag{
Name:"runtime",
Aliases:[]string{"r"},
Usage:"the runtime to setup on this node. One of {'docker', 'crio', 'containerd'}",
Value:defaultRuntime,
Destination:&options.runtime,
EnvVars:[]string{"RUNTIME"},
},
&cli.StringFlag{
Name:"toolkit-install-dir",
Aliases:[]string{"root"},
Usage:"The directory where the NVIDIA Container Toolkit is to be installed. "+
"The components of the toolkit will be installed to `ROOT`/toolkit. "+
"Note that in the case of a containerized installer, this is the path in the container and it is "+
"recommended that this match the path on the host.",
Value:defaultToolkitInstallDir,
Destination:&options.toolkitInstallDir,
EnvVars:[]string{"TOOLKIT_INSTALL_DIR","ROOT"},
},
&cli.StringFlag{
Name:"toolkit-source-root",
Usage:"The folder where the required toolkit artifacts can be found. If this is not specified, the path /artifacts/{{ .ToolkitPackageType }} is used where ToolkitPackageType is the resolved package type",
Destination:&options.sourceRoot,
EnvVars:[]string{"TOOLKIT_SOURCE_ROOT"},
},
&cli.StringFlag{
Name:"toolkit-package-type",
Usage:"specify the package type to use for the toolkit. One of ['deb', 'rpm', 'auto', '']. If 'auto' or '' are used, the type is inferred automatically.",
Value:"auto",
Destination:&options.packageType,
EnvVars:[]string{"TOOLKIT_PACKAGE_TYPE"},
},
&cli.StringFlag{
Name:"pid-file",
Value:defaultPidFile,
Usage:"the path to a toolkit.pid file to ensure that only a single configuration instance is running",
Usage:"enable the generation of a CDI specification",
Destination:&opts.CDI.Enabled,
EnvVars:[]string{"CDI_ENABLED","ENABLE_CDI"},
},
&cli.StringFlag{
Name:"cdi-output-dir",
Usage:"the directory where the CDI output files are to be written. If this is set to '', no CDI specification is generated.",
Value:"/var/run/cdi",
Destination:&opts.CDI.outputDir,
EnvVars:[]string{"CDI_OUTPUT_DIR"},
},
&cli.StringFlag{
Name:"cdi-kind",
Usage:"the vendor string to use for the generated CDI specification",
Value:"management.nvidia.com/gpu",
Destination:&opts.CDI.kind,
EnvVars:[]string{"CDI_KIND"},
},
&cli.BoolFlag{
Name:"ignore-errors",
Usage:"ignore errors when installing the NVIDIA Container toolkit. This is used for testing purposes only.",
Hidden:true,
Destination:&opts.ignoreErrors,
},
&cli.StringSliceFlag{
Name:"create-device-nodes",
Usage:"(Only applicable with --cdi-enabled) specifies which device nodes should be created. If any one of the options is set to '' or 'none', no device nodes will be created.",
By default, all commands output to `STDOUT`, but specifying the `--output` flag writes the config to the specified file.
### Generate CDI specifications
The [Container Device Interface (CDI)](https://tags.cncf.io/container-device-interface) provides
a vendor-agnostic mechanism to make arbitrary devices accessible in containerized environments. To allow NVIDIA devices to be
used in these environments, the NVIDIA Container Toolkit CLI includes functionality to generate a CDI specification for the
available NVIDIA GPUs in a system.
In order to generate the CDI specification for the available devices, run the following command:\
```bash
nvidia-ctk cdi generate
```
The default is to print the specification to STDOUT and a filename can be specified using the `--output` flag.
The specification will contain a device entries as follows (where applicable):
* An `nvidia.com/gpu=gpu{INDEX}` device for each non-MIG-enabled full GPU in the system
* An `nvidia.com/gpu=mig{GPU_INDEX}:{MIG_INDEX}` device for each MIG-device in the system
* A special device called `nvidia.com/gpu=all` which represents all available devices.
For example, to generate the CDI specification in the default location where CDI-enabled tools such as `podman`, `containerd`, `cri-o`, or the NVIDIA Container Runtime can be configured to load it, the following command can be run:
Usage:"The mode to use when discovering the available entities. "+
"One of ["+strings.Join(nvcdi.AllModes[string]()," | ")+"]. "+
"If mode is set to 'auto' the mode will be determined based on the system configuration.",
Value:string(nvcdi.ModeAuto),
Destination:&opts.mode,
EnvVars:[]string{"NVIDIA_CTK_CDI_GENERATE_MODE"},
},
&cli.StringFlag{
Name:"dev-root",
Usage:"Specify the root where `/dev` is located. If this is not specified, the driver-root is assumed.",
Destination:&opts.devRoot,
EnvVars:[]string{"NVIDIA_CTK_DEV_ROOT"},
},
&cli.StringSliceFlag{
Name:"device-name-strategy",
Usage:"Specify the strategy for generating device names. If this is specified multiple times, the devices will be duplicated for each strategy. One of [index | uuid | type-index]",
Usage:"Specify the NVIDIA GPU driver root to use when discovering the entities that should be included in the CDI specification.",
Destination:&opts.driverRoot,
EnvVars:[]string{"NVIDIA_CTK_DRIVER_ROOT"},
},
&cli.StringSliceFlag{
Name:"library-search-path",
Usage:"Specify the path to search for libraries when discovering the entities that should be included in the CDI specification.\n\tNote: This option only applies to CSV mode.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.