Commit Graph

449 Commits

Author SHA1 Message Date
Evan Lezar
e5b690e200
Resolve to legacy by default in nvidia-container-runtime-hook
Some checks are pending
CI Pipeline / code-scanning (push) Waiting to run
CI Pipeline / variables (push) Waiting to run
CI Pipeline / golang (push) Waiting to run
CI Pipeline / image (push) Blocked by required conditions
CI Pipeline / e2e-test (push) Blocked by required conditions
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-16 11:25:16 +02:00
Evan Lezar
a39d147cbb
Default to jit-cdi mode in the nvidia runtime
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-16 11:25:13 +02:00
Evan Lezar
5d4da6200a
[no-relnote] Add RuntimeMode type
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-16 11:05:54 +02:00
Evan Lezar
7f5ad9c5b2
Ensure consistent sorting of annotation devices
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-16 10:52:06 +02:00
Evan Lezar
8bfce9488d
Use functional options to construct runtime mode resolver
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-16 10:52:06 +02:00
Evan Lezar
6359cc9919
Remove docker-run as default runtime candidate
This change removes docker-runc as the highest priority
default candidate for the low-level runtimes supported by
the nvidia-container-runtime.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 17:26:19 +02:00
Evan Lezar
4f9c860a37
Merge pull request #927 from elezar/disable-device-node-creation
Disable device node creation in CDI mode
2025-06-13 16:44:18 +02:00
Evan Lezar
cc7812470f
Merge pull request #1143 from elezar/add-device-ids-to-getspec
Add device IDs to nvcdi.GetSpec API
2025-06-13 16:41:43 +02:00
Evan Lezar
8be03cfc41
[no-relnote] Ignore annotation devices for non-CDI modes
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 15:26:48 +02:00
Evan Lezar
8650ca6533
[no-relnote] Move hookCreator initialisation for readability
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 15:00:51 +02:00
Evan Lezar
1bc2a9fee3
Return annotation devices from VisibleDevices
This change includes annotation devices in CUDA.VisibleDevices
with the highest priority. This allows for the CDI device
request extraction to be consistent across all request mechanisms.

Note that this does change behaviour in the following ways:
1. Annotations are considered when resolving the runtime mode.
2. Incorrectly formed device names in annotations are no longer treated as an error.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 14:56:08 +02:00
Evan Lezar
dc87dcf786
Make CDI device requests consistent with other methods
Following the refactoring of device request extraction, we can
now make CDI device requests consistent with other methods.

This change moves to using image.VisibleDevices instead of
separate calls to CDIDevicesFromMounts and VisibleDevicesFromEnvVar.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 14:34:02 +02:00
Evan Lezar
f17d424248
Construct container info once
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 14:05:51 +02:00
Evan Lezar
426186c992
Add logic to extract annotation device requests to image type
This change updates the image.CUDA type to also extract CDI
device requests. These are only relevant IF CDI prefixes are
specifically set.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 14:05:51 +02:00
Evan Lezar
6849ebd621
Add IsPrivileged function to CUDA container type
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-13 14:05:51 +02:00
Evan Lezar
0134ba4250
Add device IDs to nvcdi.GetSpec API
This change allows device IDs to the specified in the GetSpec API.
This simplifies cases where CDI specs are being generated for specific
devices by ID.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-12 16:23:11 +02:00
Evan Lezar
27f5ec83de
Merge pull request #1125 from elezar/vulkan-target-cpu
Some checks failed
CI Pipeline / code-scanning (push) Has been cancelled
CI Pipeline / variables (push) Has been cancelled
CI Pipeline / golang (push) Has been cancelled
CI Pipeline / image (push) Has been cancelled
CI Pipeline / e2e-test (push) Has been cancelled
Add discovery of arch-specific vulkan ICD
2025-06-05 14:09:50 +02:00
Carlos Eduardo Arango Gutierrez
dede03f322
Refactor extracting requested devices from the container image
This change consolidates the logic for determining requested devices
from the container image. The logic for this has been integrated into
the image.CUDA type so that multiple implementations are not required.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
2025-06-05 12:38:45 +02:00
Evan Lezar
2de997e25b
Add discovery of arch-specific vulkan ICD
On some RPM-based platforms, the path of the Vulkan ICD file
include an architecture-specific infix to distinguish it from
other architectures. (Most notably x86_64 vs i686). This change
attempts to discover the arch-specific ICD file in addition to
the standard nvidia_icd.json.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-04 23:06:16 +02:00
Evan Lezar
e046d6ae79
Add disabled-device-node-modification hook to CDI spec
Some checks failed
CI Pipeline / code-scanning (push) Has been cancelled
CI Pipeline / variables (push) Has been cancelled
CI Pipeline / golang (push) Has been cancelled
CI Pipeline / image (push) Has been cancelled
CI Pipeline / e2e-test (push) Has been cancelled
This hook is not added to management specs.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-04 19:08:48 +02:00
Carlos Eduardo Arango Gutierrez
6cf0248321
Added ability to disable specific (or all) CDI hooks
This change adds the ability to disabled specific (or all) CDI hooks to
both the nvidia-ctk cdi generate command and the nvcdi API.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-06-03 16:01:20 +02:00
Carlos Eduardo Arango Gutierrez
b4787511d2
Consolidate HookName functionality on internal/discover pkg
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2025-06-03 15:24:43 +02:00
Evan Lezar
479df7134a
Add envvar to control debug logging in CDI hooks
This change allows hooks to be configured with debug logging. This
is currently only enabled for the hooks generated from the runtime.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-30 15:27:52 +02:00
Carlos Eduardo Arango Gutierrez
aaaa3c6275
Edit discover.mounts to have a deterministic output
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-22 15:25:03 +02:00
Carlos Eduardo Arango Gutierrez
cf3b9317ef
Refactor the way we create CDI Hooks
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2025-05-21 10:19:47 +02:00
Evan Lezar
6dfd63f4a8
Merge pull request #980 from elezar/add-rprivate-to-mount-options
Some checks are pending
CI Pipeline / code-scanning (push) Waiting to run
CI Pipeline / variables (push) Waiting to run
CI Pipeline / golang (push) Waiting to run
CI Pipeline / image (push) Blocked by required conditions
CI Pipeline / e2e-test (push) Blocked by required conditions
Add rprivate to CDI mount options
2025-05-16 07:53:39 +02:00
Evan Lezar
f4981f0876
Add cuda-compat-mode config option
This change adds an nvidia-container-runtime.modes.legacy.cuda-compat-mode
config option. This can be set to one of four values:

* ldconfig (default): the --cuda-compat-mode=ldconfig flag is passed to the nvidia-container-cli
* mount: the --cuda-compat-mode=mount flag is passed to the nvidia-conainer-cli
* disabled: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli
* hook: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli AND the
  enable-cuda-compat hook is used to provide forward compatibility.

Note that the disable-cuda-compat-lib-hook feature flag will prevent the enable-cuda-compat
hook from being used. This change also means that the allow-cuda-compat-libs-from-container
feature flag no longer has any effect.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-13 21:49:53 +02:00
Evan Lezar
a4dc28bb3f
Fix mode detection on Thor-based systems
This change updates github.com/NVIDIA/go-nvlib from v0.7.1 to v0.7.2
to allow Thor systems to be detected as Tegra-based. This allows fixes
automatic mode detection to work on these systems.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-13 21:25:11 +02:00
Evan Lezar
d0103aa6a3
Add rprivate to CDI mount options
Some checks failed
CI Pipeline / code-scanning (push) Has been cancelled
CI Pipeline / variables (push) Has been cancelled
CI Pipeline / golang (push) Has been cancelled
CI Pipeline / image (push) Has been cancelled
CI Pipeline / e2e-test (push) Has been cancelled
This ensures that mount propagation is set to rprivate for
mounts from the host into the container. This aligns with the
default in docker.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-09 15:16:13 +02:00
Evan Lezar
adb5e6719d
Merge pull request #1046 from elezar/resolve-ldcache-libs-on-arm64
Some checks failed
CI Pipeline / code-scanning (push) Has been cancelled
CI Pipeline / variables (push) Has been cancelled
CI Pipeline / golang (push) Has been cancelled
CI Pipeline / image (push) Has been cancelled
CI Pipeline / e2e-test (push) Has been cancelled
Fix resolution of libs in LDCache on ARM
2025-05-09 15:04:00 +02:00
Evan Lezar
0c765c6536
Skip nil discoverers in merge
When constructing a list of discoverers using discover.Merge we
explicitly skip `nil` discoverers to simplify usage as we don't
have to explicitly check validity when processing the discoverers
in the list.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-05-07 12:51:38 +02:00
Evan Lezar
e6cd7a3b53
Fix resolution of libs in LDCache on ARM
Since we explicitly check for the architecture of the
libraries in the ldcache, we need to also check the architecture
flag against the ARM constants.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-04-23 14:28:28 +02:00
Evan Lezar
986f3db971 Fix race condition in mounts cache
This change switches to using the WithCache decorator for
mounts instead of keeping track of a cache locally.

This addresses a race condition when using the mounts structure.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-04-02 16:29:49 +02:00
Evan Lezar
be9d7b6db1
[no-relnote] Fix QF1002: could use tagged switch on info.SubType lint errors
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-04-02 14:18:32 +02:00
Evan Lezar
c1c6534b1f
[no-relnote] Fix QF1008: could remove embedded field lint errors
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-04-02 14:18:32 +02:00
Evan Lezar
2a9bae8e80
[no-relnotes] Update moq to use rm and goimports
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-03-12 13:12:13 +02:00
Evan Lezar
bc9ec77fdd
Merge pull request #943 from elezar/add-disable-imex-channels-feature
Some checks failed
CI Pipeline / code-scanning (push) Has been cancelled
CI Pipeline / variables (push) Has been cancelled
CI Pipeline / golang (push) Has been cancelled
CI Pipeline / image (push) Has been cancelled
CI Pipeline / e2e-test (push) Has been cancelled
Add ignore-imex-channel-requests feature flag
2025-02-28 17:53:28 +02:00
Evan Lezar
aff9301f2e
Add disable-cuda-compat-lib-hook feature flag
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-27 15:58:15 +02:00
Evan Lezar
2adef9903e
Ensure that mode hook is executed last
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-27 15:58:15 +02:00
Evan Lezar
b7fbd56f7e
Add ldconfig hook in legacy mode
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-27 15:58:15 +02:00
Evan Lezar
bd87c009ba
Add enable-cuda-compat hook if required
This change adds the enable-cuda-compat hook to the incomming OCI runtime spec
if the allow-cuda-compat-libs-from-container feature flag is not enabled.

An update-ldcache hook is also injected to ensure that the required folders
are processed.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-27 15:58:15 +02:00
Evan Lezar
352b55c8ce
Add ignore-imex-channel-requests feature flag
This allows the NVIDIA Container Toolkit to ignore IMEX channel requests
through the NVIDIA_IMEX_CHANNELS envvar or volume mounts and ensures that
the NVIDIA Container Toolkit cannot be used to provide out-of-band access
to an IMEX channel by simply specifying an environment variable, possibly
bypassing other checks by an orchestration system such as kubernetes.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-26 17:46:36 +02:00
Evan Lezar
03152dba8d
Allow cdi mode to work with --gpus flag
This changes ensures that the cdi modifier also removes the NVIDIA
Container Runtime Hook from the incoming spec. This aligns with what is
done for CSV modifications and prevents an error when starting the
container.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-05 19:01:43 +01:00
Evan Lezar
cf026dce9a
[no-relnote] Remove duplicate test case
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-02-05 19:01:43 +01:00
Carlos Eduardo Arango Gutierrez
bf9d618ff2
Rename test folder to tests
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-01-23 11:46:14 +01:00
Evan Lezar
ed3b52eb8d
Add allow-cuda-compat-libs-from-container feature flag
This change adds an allow-cuda-compat-libs-from-container feature flag
to the NVIDIA Container Toolkit config. This allows a user to opt-in
to the previous default behaviour of overriding certain driver
libraries with CUDA compat libraries from the container.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-01-22 17:34:20 +01:00
Evan Lezar
991b9c222f
Skip graphics modifier in CSV mode
In CSV mode the CSV files at /etc/nvidia-container-runtime/host-files-for-container.d/
should be the source of truth for container modifications. This change skips graphics
modifications to a container. This prevents conflicts when handling files such as
vulkan icd files which are already defined in the CSV file.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-01-22 13:58:31 +01:00
Evan Lezar
fdad3927b4
[no-relnote] Refactor oci spec modifier list
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-01-22 13:58:31 +01:00
Evan Lezar
d584f9a40b
[no-relnote] Sort feature flags
Signed-off-by: Evan Lezar <elezar@nvidia.com>
2025-01-15 13:27:14 +01:00
Evan Lezar
2529aebd6c
Fix create-device-node test when devices exist
This changes fixes the TestCreateControlDevices test on systems
where device nodes exist.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2024-12-06 14:05:51 +01:00