I am running NVSHMEM 3.3.9’s perftest shmem_g_bw in an intranode scenario, with all default parameters. The maximal bandwidth is about 50GB/s; however, if I add -d double, then it gets to 90GB/s. Why?
$ srun --partition=debug --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --mpi=pmix --environment=cscs-nv-hpc-bench -A a-g200 bash -c ‘NVSHMEM_BOOTSTRAP_PMI=PMIX /nvshmem_src/build/perftest/device/pt-to-pt/shmem_g_bw -e 32M’
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 1 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1
#shmem_g_bw
size (Bytes) scope BW (GB/sec) CTAs thread/CTA
4 None 0.001476 32 256
8 None 0.003243 32 256
16 None 0.006849 32 256
32 None 0.013532 32 256
64 None 0.027322 32 256
128 None 0.054348 32 256
256 None 0.109589 32 256
512 None 0.216802 32 256
1024 None 0.435374 32 256
2048 None 0.881543 32 256
4096 None 1.780250 32 256
8192 None 3.473541 32 256
16384 None 6.580977 32 256
32768 None 12.784020 32 256
65536 None 20.500500 32 256
131072 None 28.804501 32 256
262144 None 36.555107 32 256
524288 None 42.270382 32 256
1048576 None 46.459663 32 256
2097152 None 48.534397 32 256
4194304 None 49.744579 32 256
8388608 None 50.327141 32 256
16777216 None 50.745079 32 256
33554432 None 50.870136 32 256
[clariden][yijunma@clariden-ln001 build]$ srun --partition=debug --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --mpi=pmix --environment=cscs-nv-hpc-bench -A a-g200 bash -c ‘NVSHMEM_BOOTSTRAP_PMI=PMIX /nvshmem_src/build/perftest/device/pt-to-pt/shmem_g_bw -d double -e 32M’
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: double, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 1 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: double, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1
#shmem_g_bwsize (Bytes) scope BW (GB/sec) CTAs thread/CTA
4 None 0.002013 32 256
8 None 0.003342 32 256
16 None 0.006435 32 256
32 None 0.014225 32 256
64 None 0.028409 32 256
128 None 0.056577 32 256
256 None 0.113960 32 256
512 None 0.224090 32 256
1024 None 0.451977 32 256
2048 None 0.883978 32 256
4096 None 1.777778 32 256
8192 None 3.482993 32 256
16384 None 7.042641 32 256
32768 None 12.816021 32 256
65536 None 23.167421 32 256
131072 None 37.509155 32 256
262144 None 53.160290 32 256
524288 None 67.451622 32 256
1048576 None 80.294044 32 256
2097152 None 86.710770 32 256
4194304 None 90.682159 32 256
8388608 None 92.981941 32 256
16777216 None 93.923073 32 256
33554432 None 94.402519 32 256