Multiple GPUs with nvc++ -stdpar

How to use multiple GPUs installed on a system with nvc++ -stdpar?

C++17 stdpar itself only targets a single GPU. For multiple GPUs you’ll want to use MPI where each rank uses a different GPU.

Is it possible to select a GPU per thread and not per process to benefit from the shared memory?

I assume you’re meaning using OpenMP threads and shared memory. I have not tried using OpenMP with C++ stdpar so don’t know. In theory you should be able to get something to work, but in my experience with OpenACC and CUDA, I find using MPI to manage multiple GPUs much easier and less error prone.

I didn’t mean OpenMP specifically, but since you’ve raise this topic - there is a documented way to select GPU for the OpenMP offload, and another for cuda. How does cudaSetDevice interplay with other programming models that do not use cuda directly?

I was thinking of a model not using the compiler directives, std::thread being the most standardised, and would like to keep hardware-specific code as small as possible. Do you think I’d better try SYCL with such requirments?

How does cudaSetDevice interplay with other programming models that do not use cuda directly?

C++17 itself does not have a way to select a particular device so you’d want to use cudaSetDevice in this case. Since our implementation is built on top of CUDA, it should set the correct device.

For OpenMP with Target Offload, you’ll want to use omp_set_default_device. This in turn in our implementation will call cudaSetDevice. Here’s an example on how I set the device with MPI+OpenMP:

  int num_devices;
  int gpuId;
  MPI_Comm shmcomm;
  int local_rank;
... later after calling MPI_init
  // Get the local rank number for the ranks on this system
      MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                    MPI_INFO_NULL, &shmcomm);
      MPI_Comm_rank(shmcomm, &local_rank);
  // get the number of devices on this system
      num_devices = omp_get_num_devices();
  // when targeting the host num_devices will be 0, so add a guard
      if (num_devices > 0) {
      // round robin the device assignment
        gpuId = local_rank % num_devices;
      // set the device number
        omp_set_default_device(gpuId);
     }

Do you think I’d better try SYCL with such requirments?

I’ve never used SYCL myself so can’t give any recommendations here.

I tried cudaSetDevice, but ran into a problem.
It works for a small allocation, and the right GPUs are used
However, if the size is incresed - twice of the example attached in my case, which is still far below the working single-GPU case - I get

malloc: cuMemMallocManaged returns error code 709 for new pool allocation
(null): call to cuMemAllocManagedreturned error 709: Context is destroyed or not yet created
malloc: cuMemMallocManaged returns error code 709 in pool allocation
new: call to cuMemAllocManagedreturned error 709: Context is destroyed or not yet created

(highlights of the differences are mine).

What could be the reason and how to debug such errors?

The reproducing example is below. Not the default (2 and 3 instead of 0) devices give better reproducibility.

two_stdpar.cpp (835 Bytes)

Thanks SD57, this is an interesting example. It appears to me that the CUDA context doesn’t get inherited by the std::thread. I’ve sent the example to our C++ team to see if they have any suggestions or think its something we could support in the future. Though I’m not sure how much control we’d have over the C++ thread creation so it may or may not be possible.

This code happens to work because we include a pool allocator for the managed memory thus are using the same context. Once the code uses more memory in the pool and cuMemAllocManaged is called directly, the context error occurs. You can get the larger size memory to work by increasing the amount memory the pool uses, but errors once the pool allocator is disabled.

Note that I still recommend using MPI for multiple GPU as each rank would have it’s own context thus avoiding this issue.

% setenv NVCOMPILER_ACC_POOL_SIZE 8GB   << increase the pool allocator's memory size
% nvc++ -stdpar -fast two_stdpar.cpp ; a.out
Sorted
Sorted
% setenv NVCOMPILER_ACC_POOL_ALLOC 0   << disable the pool allocator
% a.out
new: call to cuMemAllocManaged returned error 709: Context is destroyed or not yet created
new: call to cuMemAllocManaged returned error 709: Context is destroyed or not yet created
Segmentation fault

Note that documentation about our Unified Memory Pool Allocator environment variables can be found at: HPC Compilers User's Guide Version 21.1 for ARM, OpenPower, x86

FYI, I wrote a problem report, TPR #29620, with your code to see if we can get your example working as expected. It’s a good use case and we appreciate your efforts in testing.

Thank you. How can one track the progress on a TPR?
Please inform me if a non-MPI workaround is possible.

Sorry but the TPR system is not available publicly. Though we do update posts once the issue has been closed and available in a public release.