Add some docs

This commit is contained in:
Chenggang Zhao 2025-03-04 10:19:42 +08:00
parent c5b4040502
commit 2a3cac903a
2 changed files with 4 additions and 1 deletions

View File

@ -282,6 +282,10 @@ For two micro-batch overlapping, you can refer to the following figure. With our
## Notices ## Notices
#### Easier potential overall design
Current DeepEP implementation uses queues for communication buffers which saves memory but introduces complexity and potential deadlocks. If you're implementing your own version based on DeepEP, consider using fixed-size buffers allocated to maximum capacity for simplicity and better performance. For a detailed discussion of this alternative approach, see https://github.com/deepseek-ai/DeepEP/issues/39.
#### Undefined-behavior PTX usage #### Undefined-behavior PTX usage
- For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1. - For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.

View File

@ -44,7 +44,6 @@ class Buffer:
to the number of local experts. to the number of local experts.
""" """
# TODO: argument docs
# Initialize the CPP runtime # Initialize the CPP runtime
self.rank = group.rank() self.rank = group.rank()
self.group_size = group.size() self.group_size = group.size()