DeepEP/csrc/kernels
Shangyan Zhou 9eb2f84b3e
Optimize intranode combine. (#247)
* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.
2025-06-24 09:10:23 +08:00
..
api.cuh Remove the low-latency usage flag (#214) 2025-06-16 13:30:14 +08:00
buffer.cuh Initial commit 2025-02-25 09:07:53 +08:00
CMakeLists.txt Support UE8M0 data format. (#206) 2025-06-12 09:38:19 +08:00
configs.cuh Support Ampere architecture (#204) 2025-06-11 15:48:18 +08:00
exception.cuh Initial commit 2025-02-25 09:07:53 +08:00
ibgda_device.cuh Fully remove barrier FIFO designs (#200) 2025-06-10 16:23:20 +08:00
internode_ll.cu Update internode_ll.cu (#246) 2025-06-23 15:18:10 +08:00
internode.cu Fix the tail loading issue. (#219) 2025-06-18 09:23:25 +08:00
intranode.cu Optimize intranode combine. (#247) 2025-06-24 09:10:23 +08:00
launch.cuh Support more hidden size 2025-06-20 16:37:28 +08:00
layout.cu Support Ampere architecture (#204) 2025-06-11 15:48:18 +08:00
runtime.cu Support Ampere architecture (#204) 2025-06-11 15:48:18 +08:00
utils.cuh Add automatic warp count control for low-latency kernels (#213) 2025-06-16 11:56:43 +08:00