comment more clear about Memory Consistency and Barrier Visibility

Memory Consistency and Barrier Visibility: Both __syncthreads() and cute::cluster_sync() serve as synchronization points, ensuring that all threads reach the barrier before any proceed. This guarantees that all prior memory operations, including barrier initialization, are visible to all threads within the synchronization scope.
This commit is contained in:
A-transformer 2025-02-27 22:01:50 +04:00 committed by GitHub
parent a2e0d68eed
commit 92521df34d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -122,7 +122,7 @@ fp8_gemm_kernel(__nv_bfloat16* gmem_d, float* scales_b, int* grouped_layout,
(kNumTMAMulticast > 1) ? cutlass::arch::fence_barrier_init() : void();
}
// Synchronize all threads to make barrier visible in normal memory model
// Synchronize threads to ensure barrier initialization is visible to all participating threads.
(kNumTMAMulticast > 1) ? cute::cluster_sync() : __syncthreads();
// For pipeline unrolling