mirror of
https://github.com/deepseek-ai/DeepEP
synced 2025-05-06 21:14:24 +00:00
Add some docs
This commit is contained in:
parent
c5b4040502
commit
2a3cac903a
@ -282,6 +282,10 @@ For two micro-batch overlapping, you can refer to the following figure. With our
|
|||||||
|
|
||||||
## Notices
|
## Notices
|
||||||
|
|
||||||
|
#### Easier potential overall design
|
||||||
|
|
||||||
|
Current DeepEP implementation uses queues for communication buffers which saves memory but introduces complexity and potential deadlocks. If you're implementing your own version based on DeepEP, consider using fixed-size buffers allocated to maximum capacity for simplicity and better performance. For a detailed discussion of this alternative approach, see https://github.com/deepseek-ai/DeepEP/issues/39.
|
||||||
|
|
||||||
#### Undefined-behavior PTX usage
|
#### Undefined-behavior PTX usage
|
||||||
|
|
||||||
- For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.
|
- For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.
|
||||||
|
@ -44,7 +44,6 @@ class Buffer:
|
|||||||
to the number of local experts.
|
to the number of local experts.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# TODO: argument docs
|
|
||||||
# Initialize the CPP runtime
|
# Initialize the CPP runtime
|
||||||
self.rank = group.rank()
|
self.rank = group.rank()
|
||||||
self.group_size = group.size()
|
self.group_size = group.size()
|
||||||
|
Loading…
Reference in New Issue
Block a user