BUGS & FIXES
The MoE shuffle example fails to run due to below bugs. Applying the fixes below makes the code work.
Incorrect indexing
static __forceinline__ __device__ void _exchange_offsets(int64_t *local_expert_counts,
int64_t *symmetric_expert_counts,
int64_t *accumulated_expert_positions,
int npes) {
const int src_rank = threadIdx.x / npes; // <- wrong, should be threadIdx.x % npes
const int expert = threadIdx.x % npes; // <-wrong, should be threadIdx.x / npes
Summary
The incorrect code above results in expert being in the range [0, npes) while src_rank in [0, num_experts) because number of threads == npes * num_experts.
As we can see, this is incorrect and should be swapped as expert should be in [0, num_experts), while src_rank should be in [0, npes).
USE_CUDA_MALLOC Should always be True
This cannot be false! Note that expandedSrcRow_gpu is the destination pointer for a H2D memcpy, thus it must point to memory allocated by cudaMalloc not std malloc.
Running the code as is yields a CUDA failed with an illegal memory access was encountered error.
Best to remove this config altogether.