I am attempting to study and profile nvc++ performance using Nvidia’s port of LULESH as described in this Nvidia blog post. One of my tests was to understand how nvc++ could be leveraged for multi-GPU applications. I noticed that the baseline LULESH has MPI functionality included, but with Nvidia’s port using nvc++, that MPI functionality was explicitly removed. As a quick and dirty workaround, I threw together a bash script that would launch N number of processes that would all run the GPU-accelerated LULESH executable produced by nvc++ simultaneously:
#!/bin/bash
NUM_JOBS=$1
JOB="LULESH-2.0.2-dev/stdpar/build/lulesh2.0 -s 10"
echo "Launching $NUM_JOBS jobs.."
for (( i=1; i<=$NUM_JOBS; i++ ))
do
echo "Launching LULESH process # $i.."
$JOB > log_$i.txt &
pids[${i}]=$!
done
echo "Waiting for processes to complete..."
for pid in ${pids[*]};
do
echo "Waiting on LULESH pid $pid.."
wait $pid
done
echo "Done."
Being familiar with CUDA and how concurrency is allowed to work on a GPU, I expected each process using the same GPU to have its work serialized as they all have separate contexts. That should produce execution times that are fairly predictable: i.e., if we use P processes and the execution time of one process is T, it’d be safe to assume we’d have to wait P * T seconds for all processes to complete (disregarding smaller expected overhead from things like context switching, etc). So, after measuring the execution time for a single process and getting 3.24 seconds, here’s what I would expect:
1 process: 3.24 seconds
2 processes: 6.48 seconds (3.24 seconds x 2 serialized processes)
3 processes: 9.72 seconds (3.24 seconds x 3 serialized processes)
4 processes: 12.96 seconds (3.24 seconds x 4 serialized processes)
16 processes: 51.84 seconds (3.24 seconds x 16 serialized processes)
64 processes: 207.36 seconds (3.24 seconds x 64 serialized processes)
However, here’s the execution times I observed:
1 process : 3.24 seconds
2 processes: 52.37 seconds (~8.08 times slower than expected)
3 processes: 83.74 seconds (~8.62 times slower than expected)
4 processes: 109.51 seconds (~8.45 times slower than expected)
16 processes: 442.68 seconds (~8.54 times slower than expected)
64 processes: 1864.25 seconds (~8.99 times slower than expected)
To me, this seems like there is some kind of ~8x overhead we incur the moment we use multiple processes using nvc++ that I’ve never observed before using CUDA and nvcc. That overhead is instantly noticeable the moment we go to two processes and stays relatively constant at around ~8x as we increase process and context count. I have observed this behavior on multiple servers. I am not convinced it is a hardware or configuration issue as I have been able to run this test with my own toy programs and I don’t see this slowdown occur. Nvidia’s port of LULESH also includes a CUDA/nvcc version that does not have this 8x slowdown as we run the same multiprocess test with the same script. It is important for me to understand what exactly is causing this before we integrate nvc++ into our own large and complex codebases. Please advise.
System specs under test (server #1):
RHEL 7.9
NV HPC SDK 2022_2211
A100
System specs under test (server #2):
RHEL 8.3
NV HPC SDK 2022_2211
V100
Nvidia’s LULESH code referenced in the original blog post: LULESH/stdpar at 2.0.2-dev · LLNL/LULESH · GitHub