Multiple GPUs with nvc++ -stdpar

SD57 · February 15, 2021, 4:42pm

How to use multiple GPUs installed on a system with nvc++ -stdpar?

MatColgrove · February 15, 2021, 9:52pm

C++17 stdpar itself only targets a single GPU. For multiple GPUs you’ll want to use MPI where each rank uses a different GPU.

SD57 · February 16, 2021, 8:21am

Is it possible to select a GPU per thread and not per process to benefit from the shared memory?

MatColgrove · February 16, 2021, 3:25pm

I assume you’re meaning using OpenMP threads and shared memory. I have not tried using OpenMP with C++ stdpar so don’t know. In theory you should be able to get something to work, but in my experience with OpenACC and CUDA, I find using MPI to manage multiple GPUs much easier and less error prone.

SD57 · February 17, 2021, 10:02am

I didn’t mean OpenMP specifically, but since you’ve raise this topic - there is a documented way to select GPU for the OpenMP offload, and another for cuda. How does cudaSetDevice interplay with other programming models that do not use cuda directly?

I was thinking of a model not using the compiler directives, std::thread being the most standardised, and would like to keep hardware-specific code as small as possible. Do you think I’d better try SYCL with such requirments?

MatColgrove · February 17, 2021, 7:12pm

How does cudaSetDevice interplay with other programming models that do not use cuda directly?

C++17 itself does not have a way to select a particular device so you’d want to use cudaSetDevice in this case. Since our implementation is built on top of CUDA, it should set the correct device.

For OpenMP with Target Offload, you’ll want to use omp_set_default_device. This in turn in our implementation will call cudaSetDevice. Here’s an example on how I set the device with MPI+OpenMP:

  int num_devices;
  int gpuId;
  MPI_Comm shmcomm;
  int local_rank;
... later after calling MPI_init
  // Get the local rank number for the ranks on this system
      MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                    MPI_INFO_NULL, &shmcomm);
      MPI_Comm_rank(shmcomm, &local_rank);
  // get the number of devices on this system
      num_devices = omp_get_num_devices();
  // when targeting the host num_devices will be 0, so add a guard
      if (num_devices > 0) {
      // round robin the device assignment
        gpuId = local_rank % num_devices;
      // set the device number
        omp_set_default_device(gpuId);
     }

Do you think I’d better try SYCL with such requirments?

I’ve never used SYCL myself so can’t give any recommendations here.

SD57 · February 18, 2021, 9:04am

I tried cudaSetDevice, but ran into a problem.
It works for a small allocation, and the right GPUs are used
However, if the size is incresed - twice of the example attached in my case, which is still far below the working single-GPU case - I get

malloc: cuMemMallocManaged returns error code 709 for new pool allocation
(null): call to cuMemAllocManagedreturned error 709: Context is destroyed or not yet created
malloc: cuMemMallocManaged returns error code 709 in pool allocation
new: call to cuMemAllocManagedreturned error 709: Context is destroyed or not yet created

(highlights of the differences are mine).

What could be the reason and how to debug such errors?

The reproducing example is below. Not the default (2 and 3 instead of 0) devices give better reproducibility.

two_stdpar.cpp (835 Bytes)

MatColgrove · February 18, 2021, 5:39pm

Thanks SD57, this is an interesting example. It appears to me that the CUDA context doesn’t get inherited by the std::thread. I’ve sent the example to our C++ team to see if they have any suggestions or think its something we could support in the future. Though I’m not sure how much control we’d have over the C++ thread creation so it may or may not be possible.

This code happens to work because we include a pool allocator for the managed memory thus are using the same context. Once the code uses more memory in the pool and cuMemAllocManaged is called directly, the context error occurs. You can get the larger size memory to work by increasing the amount memory the pool uses, but errors once the pool allocator is disabled.

Note that I still recommend using MPI for multiple GPU as each rank would have it’s own context thus avoiding this issue.

% setenv NVCOMPILER_ACC_POOL_SIZE 8GB   << increase the pool allocator's memory size
% nvc++ -stdpar -fast two_stdpar.cpp ; a.out
Sorted
Sorted
% setenv NVCOMPILER_ACC_POOL_ALLOC 0   << disable the pool allocator
% a.out
new: call to cuMemAllocManaged returned error 709: Context is destroyed or not yet created
new: call to cuMemAllocManaged returned error 709: Context is destroyed or not yet created
Segmentation fault

Note that documentation about our Unified Memory Pool Allocator environment variables can be found at: HPC Compilers User's Guide Version 22.7 for ARM, OpenPower, x86

MatColgrove · February 18, 2021, 6:34pm

FYI, I wrote a problem report, TPR #29620, with your code to see if we can get your example working as expected. It’s a good use case and we appreciate your efforts in testing.

SD57 · February 19, 2021, 10:20am

Thank you. How can one track the progress on a TPR?
Please inform me if a non-MPI workaround is possible.

MatColgrove · February 19, 2021, 3:25pm

Sorry but the TPR system is not available publicly. Though we do update posts once the issue has been closed and available in a public release.

MatColgrove · January 2, 2024, 5:09pm

Hi SD57,

Sorry for the late notice, but engineering just let me know that TPR#29620 was fixed in our 23.3 release.

-Mat

Topic		Replies	Views
Questions for multiple GPUs CUDA Programming and Performance	8	7154	April 20, 2009
Using multiple GPUs Legacy PGI Compilers	7	22072	August 11, 2009
about multi GPU control CUDA Programming and Performance	3	701	December 23, 2019
Multiple GPU computing CUDA Programming and Performance	8	7874	May 7, 2008
CUDA Fortran+Openmp problem Legacy PGI Compilers	9	1122	March 3, 2022
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3485	June 29, 2011
OpenMP + different GTX GPUs + Driver > v391.35 (Win 10 / Win 7) CUDA Programming and Performance	15	1544	August 31, 2018
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9586	January 1, 2009
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1807	February 8, 2012

Multiple GPUs with nvc++ -stdpar

Related topics