[BUGS] MoE Shuffle

BUGS & FIXES

The MoE shuffle example fails to run due to below bugs. Applying the fixes below makes the code work.

Incorrect indexing

static __forceinline__ __device__ void _exchange_offsets(int64_t *local_expert_counts,
                                                         int64_t *symmetric_expert_counts,
                                                         int64_t *accumulated_expert_positions,
                                                         int npes) {
    const int src_rank = threadIdx.x / npes; // <- wrong, should be threadIdx.x % npes
    const int expert = threadIdx.x % npes; // <-wrong, should be threadIdx.x / npes

Summary

The incorrect code above results in expert being in the range [0, npes) while src_rank in [0, num_experts) because number of threads == npes * num_experts.

As we can see, this is incorrect and should be swapped as expert should be in [0, num_experts), while src_rank should be in [0, npes).

USE_CUDA_MALLOC Should always be True

This cannot be false! Note that expandedSrcRow_gpu is the destination pointer for a H2D memcpy, thus it must point to memory allocated by cudaMalloc not std malloc.

Running the code as is yields a CUDA failed with an illegal memory access was encountered error.

Best to remove this config altogether.

1 Like