Hello!
I have noticed an anomaly in the broadcast algorithm detection code shipped in NVSHMEM 2.11.0-5. Specifically, the code section of interest is outlined below and available in nvshmem_src_2.11.0-5/src/include/device/coll/broadcast.cuh
//...
int bcast_algo = nvshmemi_device_state_d.gpu_coll_env_params_var.bcast_algo;
switch (bcast_algo) {
case 0:
if (NVSHMEMI_BCAST_SYNC_SIZE * sizeof(long) >= (nelems * sizeof(T) * 2) &&
sizeof(T) >= sizeof(uint32_t) && nelems % 2 == 0 &&
nelems * sizeof(T) <= 16348) { /* LL algos */
if (nvshmem_team_n_pes(team) > 32 &&
nvshmemi_device_state_d.pe_dist ==
NVSHMEMI_PE_DIST_BLOCK) { /* hierarchical topo-aware */
bcast_algo = 2;
} else
bcast_algo = 3;
} else /* non-LL algorithm */
bcast_algo = 4;
break;
case 1: /* Brutefoce algorithm: send one to all followed by barrier */
break;
//...
Questions
- Is there a reason why we the code does
<=16348
and not<=16384
, since 16k=16384? I ask because as the plots below show, performance degrades noticeably at “16k” since, presumably, the less-performant algorithm is being used. - Across the board, BCAST_ALGO=3 performs well for
<= 64k
, why then do we not use64k
as the cutoff instead?
Benchmarks
I performed all below benchmarks using perf_test
on the Perlmutter across GPU nodes and within a GPU node.
yx4
means y
GPU nodes, with each comprising 4 A100 80 GB GPUs, so world size is yx4
GPUs.