DeepEP/csrc/kernels
Chenggang Zhao c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels (#191)
* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait
2025-06-06 15:40:17 +08:00
..
api.cuh Support zero-copy for low-latency combine 2025-03-18 15:41:50 +08:00
buffer.cuh Initial commit 2025-02-25 09:07:53 +08:00
CMakeLists.txt Initial commit 2025-02-25 09:07:53 +08:00
configs.cuh Use TMA instead of LD/ST for intra-node normal kernels (#191) 2025-06-06 15:40:17 +08:00
exception.cuh Initial commit 2025-02-25 09:07:53 +08:00
ibgda_device.cuh Use IBGDA only (#177) 2025-05-28 16:40:14 +08:00
internode_ll.cu Code cleanup and bug fixed 2025-05-23 11:14:16 +08:00
internode.cu Fix notify_dispatch: using warp 0 to issue send 2025-06-03 20:20:02 +08:00
intranode.cu Use TMA instead of LD/ST for intra-node normal kernels (#191) 2025-06-06 15:40:17 +08:00
launch.cuh support hidden size 4096 2025-05-12 16:41:21 +08:00
runtime.cu Use IBGDA only (#177) 2025-05-28 16:40:14 +08:00
utils.cuh Use TMA instead of LD/ST for intra-node normal kernels (#191) 2025-06-06 15:40:17 +08:00