Nvc++ 8x slower using multiple GPUs via multiple processes with LULESH

I am attempting to study and profile nvc++ performance using Nvidia’s port of LULESH as described in this Nvidia blog post. One of my tests was to understand how nvc++ could be leveraged for multi-GPU applications. I noticed that the baseline LULESH has MPI functionality included, but with Nvidia’s port using nvc++, that MPI functionality was explicitly removed. As a quick and dirty workaround, I threw together a bash script that would launch N number of processes that would all run the GPU-accelerated LULESH executable produced by nvc++ simultaneously:

#!/bin/bash

NUM_JOBS=$1
JOB="LULESH-2.0.2-dev/stdpar/build/lulesh2.0 -s 10"

echo "Launching $NUM_JOBS jobs.."
for (( i=1; i<=$NUM_JOBS; i++ ))
do
    echo "Launching LULESH process # $i.."
    $JOB > log_$i.txt &
    pids[${i}]=$!
done

echo "Waiting for processes to complete..."
for pid in ${pids[*]};
do
    echo "Waiting on LULESH pid $pid.."
    wait $pid
done

echo "Done."

Being familiar with CUDA and how concurrency is allowed to work on a GPU, I expected each process using the same GPU to have its work serialized as they all have separate contexts. That should produce execution times that are fairly predictable: i.e., if we use P processes and the execution time of one process is T, it’d be safe to assume we’d have to wait P * T seconds for all processes to complete (disregarding smaller expected overhead from things like context switching, etc). So, after measuring the execution time for a single process and getting 3.24 seconds, here’s what I would expect:

1 process: 3.24 seconds
2 processes: 6.48 seconds (3.24 seconds x 2 serialized processes)
3 processes: 9.72 seconds (3.24 seconds x 3 serialized processes)
4 processes: 12.96 seconds (3.24 seconds x 4 serialized processes)
16 processes: 51.84 seconds  (3.24 seconds x 16 serialized processes)
64 processes: 207.36 seconds (3.24 seconds x 64 serialized processes)

However, here’s the execution times I observed:

1 process    : 3.24 seconds
2 processes: 52.37 seconds (~8.08 times slower than expected)
3 processes: 83.74 seconds (~8.62 times slower than expected)
4 processes: 109.51 seconds (~8.45 times slower than expected)
16 processes: 442.68 seconds (~8.54 times slower than expected)
64 processes: 1864.25 seconds (~8.99 times slower than expected)

To me, this seems like there is some kind of ~8x overhead we incur the moment we use multiple processes using nvc++ that I’ve never observed before using CUDA and nvcc. That overhead is instantly noticeable the moment we go to two processes and stays relatively constant at around ~8x as we increase process and context count. I have observed this behavior on multiple servers. I am not convinced it is a hardware or configuration issue as I have been able to run this test with my own toy programs and I don’t see this slowdown occur. Nvidia’s port of LULESH also includes a CUDA/nvcc version that does not have this 8x slowdown as we run the same multiprocess test with the same script. It is important for me to understand what exactly is causing this before we integrate nvc++ into our own large and complex codebases. Please advise.

System specs under test (server #1):
RHEL 7.9
NV HPC SDK 2022_2211
A100

System specs under test (server #2):
RHEL 8.3
NV HPC SDK 2022_2211
V100

Nvidia’s LULESH code referenced in the original blog post: LULESH/stdpar at 2.0.2-dev · LLNL/LULESH · GitHub

I don’t believe this correct. When processes share the GPU their scheduling resources must be swapped on and off the GPU causing overhead.

Consider using MPS to help with time slicing the GPU and reducing this overhead. See: Multi-Process Service :: GPU Deployment and Management Documentation

-Mat

Hi Mat,

Thanks for the suggestion. I believe you’re referring to the overhead related to context switching. I don’t think context switching could account for 800% slowdowns. More concretely, I mentioned in my post that I ran this test with the CUDA version of LULESH. If this inefficiency was related to context switching, shouldn’t we have seen it there as well? The CUDA version’s performance more or less scaled exactly as I expected with minimal overhead as I increased processes (slowdowns were 1.06x - 1.10x compared with nvc++'s 8x I mentioned above)

I took a deeper look. The profiles suggest the issue is with CUDA Unified Memory where oversubscription is causing an increase in the data movement and overhead. Likely if we port the CUDA version to use UM, we’d see the same issue there.

The work around would be to add manual data management either via CUDA or OpenACC directives and add the “-gpu=nomanged” flag so UM is not used.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.