Commit Graph

  • ed3444bf9b Fix transcation window. (#260) main Shangyan Zhou 2025-06-26 17:58:49 +08:00
  • 899ba80009 Fix transcation window. fix-trans-window Shangyan Zhou 2025-06-26 17:31:58 +08:00
  • 3071a2cd37 Move __syncthread and fence into barrier. fix-intra-notify Shangyan Zhou 2025-06-26 10:03:37 +08:00
  • f2c2ad5565 Modify the default num qps. revert-trans-window Shangyan Zhou 2025-06-25 18:22:16 +08:00
  • fe72093203 Use env. Shangyan Zhou 2025-06-25 18:18:12 +08:00
  • 4931324861 Support bias. (#257) Shangyan Zhou 2025-06-25 13:04:20 +08:00
  • 85adc566e2 Add get_comm_stream. (#256) Shangyan Zhou 2025-06-25 13:02:13 +08:00
  • bd429ffefc Support bias. (#257) Shangyan Zhou 2025-06-25 13:04:20 +08:00
  • b80e55e21f Add get_comm_stream. (#256) Shangyan Zhou 2025-06-25 13:02:13 +08:00
  • f1d7a7c89f Remove memory fence in NVLink barrier. Shangyan Zhou 2025-06-24 17:51:28 +08:00
  • d7d13878e0 Add transaction windows internode-tma Chenggang Zhao 2025-06-24 10:12:23 +08:00
  • 185ecf5c4a Merge remote-tracking branch 'origin/main' into internode-tma Chenggang Zhao 2025-06-24 09:29:07 +08:00
  • a15faa9ff0 Remove useless assertion Chenggang Zhao 2025-06-24 09:21:35 +08:00
  • bc118b248a Add the transaction window data structure for RDMA senders (#245) Chenggang Zhao 2025-06-24 09:12:40 +08:00
  • 9eb2f84b3e Optimize intranode combine. (#247) Shangyan Zhou 2025-06-24 09:10:23 +08:00
  • fdb41efbd3 Fix bugs trans-window Chenggang Zhao 2025-06-23 17:58:44 +08:00
  • 901cdf79be Fix bugs Chenggang Zhao 2025-06-23 17:54:57 +08:00
  • b3e39fcbbb Fix stuck Chenggang Zhao 2025-06-23 16:34:41 +08:00
  • c0da8eaba5 Add sender timeout checks Chenggang Zhao 2025-06-23 15:41:49 +08:00
  • fbcf430006 Update internode_ll.cu (#246) fzyzcjy 2025-06-23 15:18:10 +08:00
  • be96674e94 Fix several bugs Chenggang Zhao 2025-06-23 15:17:19 +08:00
  • 11053474b7 Merge remote-tracking branch 'origin/main' into trans-window Chenggang Zhao 2025-06-23 11:49:55 +08:00
  • 6a708ec14c Add fast-debugging flags Chenggang Zhao 2025-06-23 11:47:51 +08:00
  • 1c277c303e Add draft Chenggang Zhao 2025-06-23 11:45:05 +08:00
  • c95997f8c4 Update deep_ep.cpp (#242) fzyzcjy 2025-06-23 11:44:06 +08:00
  • 7b0c25f864 Support more hidden size Chenggang Zhao 2025-06-20 16:37:28 +08:00
  • a086ac5536 Use correct buffer pointers Chenggang Zhao 2025-06-20 16:25:49 +08:00
  • 782b40a8ff Add ENABLE_FAST_DEBUG Chenggang Zhao 2025-06-20 14:44:53 +08:00
  • 47dd77ab5f Add retired flag Chenggang Zhao 2025-06-20 14:35:15 +08:00
  • 74afd75df2 Fix bugs Chenggang Zhao 2025-06-20 14:27:54 +08:00
  • 371df2da52 Fix bugs Chenggang Zhao 2025-06-20 13:44:49 +08:00
  • 8da790e3f3 Fix the shifted buffer pointer Chenggang Zhao 2025-06-20 11:31:57 +08:00
  • cd5c57fb2a Fix compilation Chenggang Zhao 2025-06-20 11:15:03 +08:00
  • 49b9084268 Fix several bugs Chenggang Zhao 2025-06-20 10:57:56 +08:00
  • 177e491e92 Fix send heads Chenggang Zhao 2025-06-19 18:05:59 +08:00
  • 55bbd8caaf Add impl Chenggang Zhao 2025-06-19 17:15:43 +08:00
  • a0a6e22eff Fully remove forwarders' and NVL receivers' code Chenggang Zhao 2025-06-19 13:48:07 +08:00
  • 3a3398f686 Minor fix Chenggang Zhao 2025-06-19 10:38:42 +08:00
  • 9d4f7ef8ee Surpass type checks Chenggang Zhao 2025-06-18 16:04:42 +08:00
  • b56f7c2c8c Adjust import order Chenggang Zhao 2025-06-18 15:50:06 +08:00
  • a2d2354e1d Merge pull request #222 from deepseek-ai/set_dev_id Shangyan Zhou 2025-06-18 14:53:26 +08:00
  • cd371d31fc Move import. Shangyan Zhou 2025-06-18 14:52:04 +08:00
  • bf4a4a21d2 Set device_id to suppress pytorch warning. Shangyan Zhou 2025-06-18 14:43:38 +08:00
  • 24453275e3 Add EP_TEST_LL_COMPATIBILITY Chenggang Zhao 2025-06-18 10:59:44 +08:00
  • 77f97f79bd Fix the tail loading issue. (#219) Shangyan Zhou 2025-06-18 09:23:25 +08:00
  • dd133d39bc Fix warp synchronization. (#215) Shangyan Zhou 2025-06-16 17:05:11 +08:00
  • 8aaddf76ae Remove the low-latency usage flag (#214) Chenggang Zhao 2025-06-16 13:30:14 +08:00
  • 74f4ef7b22 Remove the low-latency usage flag remove-usage-flag Chenggang Zhao 2025-06-16 13:28:24 +08:00
  • 1b92be8a71 Add automatic warp count control for low-latency kernels (#213) Chenggang Zhao 2025-06-16 11:56:43 +08:00
  • b09308b731 More assertions maximum-sms Chenggang Zhao 2025-06-16 11:53:57 +08:00
  • 72beb15827 Add automatic warp count control for low-latency combine Chenggang Zhao 2025-06-16 11:42:04 +08:00
  • 632c81f1d7 Add automatic warp count control for low-latency dispatch Chenggang Zhao 2025-06-16 11:31:38 +08:00
  • 4e923188f7 Update intranode.cu (#210) fzyzcjy 2025-06-16 11:03:58 +08:00
  • 483f00af84 Update assertion of num_rc_per_pe. Shangyan Zhou 2025-06-13 15:16:23 +08:00
  • 05df5554ff Use one qp per sm for internode normal kernels (#181) Zhicheng Wu 2025-06-13 14:37:59 +08:00
  • 21efbe9b48 Support UE8M0 data format. (#206) Shifang Xu 2025-06-12 09:38:19 +08:00
  • 9ec061204e Use pynvml to detect NVLink connections (#205) Chenggang Zhao 2025-06-11 17:29:00 +08:00
  • b8d90fb753 Support Ampere architecture (#204) Chenggang Zhao 2025-06-11 15:48:18 +08:00
  • dd13c7145c Check the empty list Chenggang Zhao 2025-06-11 11:14:30 +08:00
  • a8299ca7c2 Support CUDA graph for intranode normal kernels (#203) Chenggang Zhao 2025-06-11 11:08:54 +08:00
  • 8da2d7b38d Fully remove barrier FIFO designs (#200) Chenggang Zhao 2025-06-10 16:23:20 +08:00
  • a16af40531 Merge pull request #201 from youkaichao/no_gdrcopy Shangyan Zhou 2025-06-10 16:11:32 +08:00
  • b9b7ce348b update readme youkaichao 2025-06-10 15:49:50 +08:00
  • 97be5a3873 update the patch youkaichao 2025-06-10 15:39:44 +08:00
  • 1157693c0c Remove useless comments Chenggang Zhao 2025-06-09 17:14:25 +08:00
  • 5a2e37fa28 Support statistics tensor for low-latency kernels (#196) Chenggang Zhao 2025-06-09 15:50:56 +08:00
  • 0d1a855d81 Add low-latency kernel PCIe usage flag (#195) Chenggang Zhao 2025-06-09 14:37:13 +08:00
  • 564e375234 Fix < PTX ISA 8.6 compatibility (#194) Chenggang Zhao 2025-06-09 10:48:42 +08:00
  • 11a0b0e1a3 Merge pull request #193 from fzyzcjy/feat/fix_mnnvl Shangyan Zhou 2025-06-08 13:05:08 +08:00
  • 4cd951700e more fzyzcjy 2025-06-07 21:39:00 +08:00
  • c8dceba110 Use TMA instead of LD/ST for intra-node normal kernels (#191) Chenggang Zhao 2025-06-06 15:40:17 +08:00
  • df4debe30c Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190) Shangyan Zhou 2025-06-06 13:25:43 +08:00
  • d8dd185c68 Update README Chenggang Zhao 2025-06-05 14:41:51 +08:00
  • de8cfca3cf Update readme. Shangyan Zhou 2025-06-05 09:59:58 +08:00
  • fc48a467a7 Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch Shangyan Zhou 2025-06-05 09:29:16 +08:00
  • d0225df27d Fix notify_dispatch: using warp 0 to issue send wzc.wuzhicheng 2025-06-03 20:20:02 +08:00
  • 9fe9021f29 Use IBGDA only (#177) Shangyan Zhou 2025-05-28 16:40:14 +08:00
  • aae9fa9a6d Allow NVLink traffic for low-latency kernels by default Chenggang Zhao 2025-05-23 20:14:50 +08:00
  • 8da1b1f81e Merge pull request #174 from deepseek-ai/p2p-refactor Shangyan Zhou 2025-05-23 11:25:38 +08:00
  • 92405ddf30 Code cleanup and bug fixed Chenggang Zhao 2025-05-23 11:14:16 +08:00
  • 68ae8b3d07 Feature: LL nvlink p2p (#173) cywork121 2025-05-23 10:37:45 +08:00
  • d5ca4495c0 Make TORCH_CUDA_ARCH_LIST as an environment variable (#167) guyueh1 2025-05-18 18:43:48 -07:00
  • bb393e7760 Merge pull request #154 from sleepcoo/support-more-hidden Chenggang Zhao 2025-05-12 16:55:09 +08:00
  • a107266a4e support hidden size 4096 sleepcoo 2025-05-12 16:32:32 +08:00
  • 05104029fd Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection Shangyan Zhou 2025-05-12 09:32:47 +08:00
  • f0a9f10629 Merge pull request #153 from wangfakang/opt-shuffled_dst Chenggang Zhao 2025-05-12 09:25:21 +08:00
  • 63c29d06a0 To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels wangfakang 2025-05-09 17:43:01 +08:00
  • c6051f3880 Feat: enhance nvidia peer memory detection Vico Chu 2025-05-09 17:12:07 +08:00
  • 9056a6db95 Merge pull request #142 from fzyzcjy/patch-3 Chenggang Zhao 2025-05-08 16:04:54 +08:00
  • adc6e24cb0 Update deep_ep.cpp fzyzcjy 2025-05-08 16:01:47 +08:00
  • 23ded3bd8d Update deep_ep.cpp fzyzcjy 2025-04-29 09:58:31 +08:00
  • 65e2a700f0 Merge pull request #135 from deepseek-ai/add-iw-fork Shangyan Zhou 2025-04-27 10:51:18 +08:00
  • 1a0c8f6425 Add Infrawaves' fork to README. Shangyan Zhou 2025-04-27 10:37:30 +08:00
  • 007fcfcf97 Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp Chenggang Zhao 2025-04-22 13:04:42 +08:00
  • e255d57bef Use put_nbi_warp. Shangyan Zhou 2025-04-22 12:29:46 +08:00
  • 3b1045db43 Fix the performance data. Shangyan Zhou 2025-04-22 11:23:42 +08:00
  • edbb1bc3ff Several code lints Chenggang Zhao 2025-04-22 10:52:10 +08:00
  • 3e54b78fd7 Normal kernels always use IBGDA mode. Shangyan Zhou 2025-04-22 10:36:24 +08:00
  • 20b2aaaf9e Refactor some code. Shangyan Zhou 2025-04-22 10:22:30 +08:00
  • c07fdd197c Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp moningchen 2025-04-21 21:31:49 +08:00