Broadcast Algorithm Detection Anomaly

Hello!

I have noticed an anomaly in the broadcast algorithm detection code shipped in NVSHMEM 2.11.0-5. Specifically, the code section of interest is outlined below and available in nvshmem_src_2.11.0-5/src/include/device/coll/broadcast.cuh

//...
int bcast_algo = nvshmemi_device_state_d.gpu_coll_env_params_var.bcast_algo;
    switch (bcast_algo) {
        case 0:
            if (NVSHMEMI_BCAST_SYNC_SIZE * sizeof(long) >= (nelems * sizeof(T) * 2) &&
                sizeof(T) >= sizeof(uint32_t) && nelems % 2 == 0 &&
                nelems * sizeof(T) <= 16348) { /* LL algos */
                if (nvshmem_team_n_pes(team) > 32 &&
                    nvshmemi_device_state_d.pe_dist ==
                        NVSHMEMI_PE_DIST_BLOCK) { /* hierarchical topo-aware */
                    bcast_algo = 2;
                } else
                    bcast_algo = 3;
            } else /* non-LL algorithm */
                bcast_algo = 4;
            break;
        case 1: /* Brutefoce algorithm: send one to all followed by barrier */
            break;
//...

Questions

  1. Is there a reason why we the code does <=16348 and not <=16384, since 16k=16384? I ask because as the plots below show, performance degrades noticeably at “16k” since, presumably, the less-performant algorithm is being used.
  2. Across the board, BCAST_ALGO=3 performs well for <= 64k, why then do we not use 64k as the cutoff instead?

Benchmarks

I performed all below benchmarks using perf_test on the Perlmutter across GPU nodes and within a GPU node.
yx4 means y GPU nodes, with each comprising 4 A100 80 GB GPUs, so world size is yx4 GPUs.

Broadcast with 1x4 A100 80 GB

Broadcast with 2x4 A100 80 GB



  1. That is a typo. We will fix that. Thank you for pointing out.
  2. The 16K was a heuristic number based on some of our runs. As of now, the user is expected to adapt it based on the platform they are running on. The threshold will vary a lot from one network to other as the bandwidth varies. Hopefully, in the near future we will have a way to automate that.
1 Like

Thanks @alanger. Currently, NVSHMEM does not expose any environment variable to change the threshold, only the algorithm. So, the only way is for the user to change the threshold in code. Looking forward to future work on automating this!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.