Use TMA instead of LD/ST for intra-node normal kernels (#191)

* Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait
2025-06-26 18:28:11 +00:00 · 2025-06-06 15:40:17 +08:00
parent df4debe30c
commit c8dceba110
6 changed files with 230 additions and 87 deletions
--- a/csrc/kernels/configs.cuh
+++ b/csrc/kernels/configs.cuh
@@ -19,8 +19,6 @@
 #ifdef __CLION_IDE__
 #define __CUDA_ARCH__ 900 // NOLINT(*-reserved-identifier)
 #define __CUDACC_RDC__ // NOLINT(*-reserved-identifier)
-__host__ __device__ __forceinline__ void host_device_printf(const char* format, ...) { asm volatile("trap;"); }
-#define printf host_device_printf
 #endif

 // Remove Torch restrictions