Merge pull request #201 from youkaichao/no_gdrcopy

remove the dependency of gdrcopy
This commit is contained in:
Shangyan Zhou
2025-06-10 16:11:32 +08:00
committed by GitHub
2 changed files with 64 additions and 69 deletions

96
third-party/README.md vendored
View File

@@ -8,66 +8,19 @@
## Prerequisites
1. [GDRCopy](https://github.com/NVIDIA/gdrcopy) (v2.4 and above recommended) is a low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology, and *it requires kernel module installation with root privileges.*
2. Hardware requirements
- GPUDirect RDMA capable devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
Hardware requirements:
- GPUs inside one node needs to be connected by NVLink
- GPUs across different nodes needs to be connected by RDMA devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
- InfiniBand GPUDirect Async (IBGDA) support, see [IBGDA Overview](https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/)
- For more detailed requirements, see [NVSHMEM Hardware Specifications](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html#hardware-requirements)
## Installation procedure
### 1. Install GDRCopy
GDRCopy requires kernel module installation on the host system. Complete these steps on the bare-metal host before container deployment:
#### Build and installation
```bash
wget https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
cd gdrcopy-2.4.4/
make -j$(nproc)
sudo make prefix=/opt/gdrcopy install
```
#### Kernel module installation
After compiling the software, you need to install the appropriate packages based on your Linux distribution.
For instance, using Ubuntu 22.04 and CUDA 12.3 as an example:
```bash
pushd packages
CUDA=/path/to/cuda ./build-deb-packages.sh
sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb \
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb \
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb \
gdrcopy_2.4.4_amd64.Ubuntu22_04.deb
popd
sudo ./insmod.sh # Load kernel modules on the bare-metal system
```
#### Container environment notes
For containerized environments:
1. Host: keep kernel modules loaded (`gdrdrv`)
2. Container: install DEB packages *without* rebuilding modules:
```bash
sudo dpkg -i gdrcopy_2.4.4_amd64.Ubuntu22_04.deb \
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb \
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb
```
#### Verification
```bash
gdrcopy_copybw # Should show bandwidth test results
```
### 2. Acquiring NVSHMEM source code
### 1. Acquiring NVSHMEM source code
Download NVSHMEM v3.2.5 from the [NVIDIA NVSHMEM OPEN SOURCE PACKAGES](https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz).
### 3. Apply our custom patch
### 2. Apply our custom patch
Navigate to your NVSHMEM source directory and apply our provided patch:
@@ -75,7 +28,7 @@ Navigate to your NVSHMEM source directory and apply our provided patch:
git apply /path/to/deep_ep/dir/third-party/nvshmem.patch
```
### 4. Configure NVIDIA driver
### 3. Configure NVIDIA driver (required by inter-node communication)
Enable IBGDA by modifying `/etc/modprobe.d/nvidia.conf`:
@@ -92,26 +45,31 @@ sudo reboot
For more detailed configurations, please refer to the [NVSHMEM Installation Guide](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html).
### 5. Build and installation
### 4. Build and installation
The following example demonstrates building NVSHMEM with IBGDA support:
DeepEP uses NVLink for intra-node communication and IBGDA for inter-node communication. All the other features are disabled to reduce the dependencies.
```bash
CUDA_HOME=/path/to/cuda \
GDRCOPY_HOME=/path/to/gdrcopy \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
export CUDA_HOME=/path/to/cuda
# disable all features except IBGDA
export NVSHMEM_IBGDA_SUPPORT=1
cd build
make -j$(nproc)
make install
export NVSHMEM_SHMEM_SUPPORT=0
export NVSHMEM_UCX_SUPPORT=0
export NVSHMEM_USE_NCCL=0
export NVSHMEM_PMIX_SUPPORT=0
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0
export NVSHMEM_USE_GDRCOPY=0
export NVSHMEM_IBRC_SUPPORT=0
export NVSHMEM_BUILD_TESTS=0
export NVSHMEM_BUILD_EXAMPLES=0
export NVSHMEM_MPI_SUPPORT=0
export NVSHMEM_BUILD_HYDRA_LAUNCHER=0
export NVSHMEM_BUILD_TXZ_PACKAGE=0
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0
cmake -G Ninja -S . -B build -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
cmake --build build/ --target install
```
## Post-installation configuration

View File

@@ -435,3 +435,40 @@ index c89f408..f99018a 100644
return NVSHMEMX_ERROR_INTERNAL;
}
From 099f608fcd9a1d34c866ad75d0af5d02d2020374 Mon Sep 17 00:00:00 2001
From: Kaichao You <youkaichao@gmail.com>
Date: Tue, 10 Jun 2025 00:35:03 -0700
Subject: [PATCH] remove gdrcopy dependency
---
src/modules/transport/ibgda/ibgda.cpp | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/modules/transport/ibgda/ibgda.cpp b/src/modules/transport/ibgda/ibgda.cpp
index ef325cd..16ee09c 100644
--- a/src/modules/transport/ibgda/ibgda.cpp
+++ b/src/modules/transport/ibgda/ibgda.cpp
@@ -406,6 +406,7 @@ static size_t ibgda_get_host_page_size() {
return host_page_size;
}
+#ifdef NVSHMEM_USE_GDRCOPY
int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
nvshmemt_ibgda_state_t *ibgda_state = (nvshmemt_ibgda_state_t *)t->state;
int n_devs_selected = ibgda_state->n_devs_selected;
@@ -459,6 +460,11 @@ int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
}
return 0;
}
+#else
+int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
+ return NVSHMEMX_ERROR_NOT_SUPPORTED;
+}
+#endif
int nvshmemt_ibgda_show_info(struct nvshmem_transport *transport, int style) {
NVSHMEMI_ERROR_PRINT("ibgda show info not implemented");
--
2.34.1