Add some docs

2025-06-26 18:28:11 +00:00 · 2025-03-04 10:19:42 +08:00 · 2025-03-04 10:19:42 +08:00 · 2a3cac903a
commit 2a3cac903a
parent c5b4040502
2 changed files with 4 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -282,6 +282,10 @@ For two micro-batch overlapping, you can refer to the following figure. With our
 ## Notices
 #### Easier potential overall design
 Current DeepEP implementation uses queues for communication buffers which saves memory but introduces complexity and potential deadlocks. If you're implementing your own version based on DeepEP, consider using fixed-size buffers allocated to maximum capacity for simplicity and better performance. For a detailed discussion of this alternative approach, see https://github.com/deepseek-ai/DeepEP/issues/39.
 #### Undefined-behavior PTX usage
 - For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.
--- a/deep_ep/buffer.py
+++ b/deep_ep/buffer.py
@ -44,7 +44,6 @@ class Buffer:
                to the number of local experts.
        """
        # TODO: argument docs
        # Initialize the CPP runtime
        self.rank = group.rank()
        self.group_size = group.size()