Std parallelism, std::deque and std::pmr

paleonix · June 3, 2022, 4:58pm

I played around a bit with using std::deque as container in GPU-accelerated std-parallelized code samples. It seems to work in general and profiling with nsys shows kernels being “only” ~30% slower than when using std::vector (probably due to the internal array of std::deque in libstdc++ being only 8 elements big instead of e.g. 32 and the additional complexity of the iterator). The problem is the allocation of big deques (multiple GB) taking enormously long (many minutes) due to the sheer number of calls to cudaMallocManaged which is probably not optimized for this kind of usage.

I tried to use std::pmr memory resource adaptors (monotonic_buffer_resource and unsynchronized_pool_resource) to allow for bigger allocations. The first problem was that heap allocations through std::pmr::new_delete_resource (default pmr memory resource being adapted by the pool) don’t seem to get intercepted by nvc++ and exchanged for cudaMallocManaged. The explanation for this is probably that the definition of this resource is precompiled (no templates) and nvc++ only intercepting calls to new/delete in code that is actually compiled with it. This was easy to fix by writing my own new_delete_resource.

Now the whole thing seems to work (without taking many minutes) but prints out different amounts (dependent on type of memory resource adaptor) of

free: cuMemFree returns error code 1

to the console.

When running with compute-sanitizer --leak-check full, it looks like a lot of memory is being leaked although I don’t trust that output completely, as it also shows a

Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.

before that (might be related to my Pascal GPU not being supported by compute-sanitizer?).

After the free: cuMemFree returns error code 1 it shows

Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuMemFree_v2.

which seems like a helpful description of the problem.

For comparison I also tried using

thrust::mr::unsynchronized_pool_resource<thrust::mr::new_delete_resource>

instead, which works but also gives the same free: cuMemFree returns error code 1 on freeing the memory (Interestingly compute-sanitizer has nothing to say about them in this case. It still reports a lot of memory being leaked though). Something seems fundamentally different between normal heap memory and managed memory which manifests when it is used through memory resources (alignment?).

When implementing an old-school non-polymorphic allocator behaving like a pool or monotonic buffer resource, I would somewhat expect the same errors to pop up, but I haven’t tried it yet as it is a bit more work. Does anyone have an idea why this happens and how to fix it (from my side or from nvc++'s, without using Thrust directly)?

I’m on “nvc++ 22.3-0 64-bit target on x86-64 Linux” with “gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0” on a GTX 1070. I compile with

nvc++ -O3 -std=c++20 -stdpar=gpu -gpu=cc61,cuda11.6`

saxpy_deque_thrust_mr.cpp (1.2 KB)
saxpy_deque_pmr.cpp (1.8 KB)

MatColgrove · June 3, 2022, 10:11pm

Thanks paleonix. This one I’ll need to pass on to our C++ compiler team to get a better understand of what going one. Though the engineer that’s best to look isn’t feeling well today, but will take a look as soon as he’s able.

-Mat

MatColgrove · August 23, 2022, 3:06pm

Hi paleonix,

Apologies for the late update. Engineering was able to fix this issue in our 22.7 release.

-Mat

system · September 6, 2022, 3:06pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HPC SDK C++ unified memory & minimum heap allocation size nvc, nvc++ and nvfortran	7	524	July 15, 2020
Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type nvc, nvc++ and nvfortran	7	349	April 2, 2024
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2811	March 23, 2018
About Using Tuples in Device Code CUDA Programming and Performance	6	248	September 4, 2024
performance problem CUDA Programming and Performance	2	611	July 16, 2018
Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0 CUDA Programming and Performance	15	1375	December 27, 2018
cudaMallocManaged and CUDA 8.0 CUDA Programming and Performance	5	2527	June 21, 2018
Can't run simple std::par program nvc, nvc++ and nvfortran	3	217	May 27, 2024
Nvc++ PSTL: scan failed to synchronize: cudaErrorIllegalAddress nvc, nvc++ and nvfortran	4	810	March 22, 2021
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1664	October 26, 2020

Std parallelism, std::deque and std::pmr

Related topics