mirror of
https://github.com/deepseek-ai/DeepEP
synced 2025-06-26 18:28:11 +00:00
Use TMA instead of LD/ST for intra-node normal kernels (#191)
* Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait
This commit is contained in:
@@ -19,8 +19,6 @@
|
||||
#ifdef __CLION_IDE__
|
||||
#define __CUDA_ARCH__ 900 // NOLINT(*-reserved-identifier)
|
||||
#define __CUDACC_RDC__ // NOLINT(*-reserved-identifier)
|
||||
__host__ __device__ __forceinline__ void host_device_printf(const char* format, ...) { asm volatile("trap;"); }
|
||||
#define printf host_device_printf
|
||||
#endif
|
||||
|
||||
// Remove Torch restrictions
|
||||
|
||||
Reference in New Issue
Block a user