Std parallelism, std::deque and std::pmr

I played around a bit with using std::deque as container in GPU-accelerated std-parallelized code samples. It seems to work in general and profiling with nsys shows kernels being “only” ~30% slower than when using std::vector (probably due to the internal array of std::deque in libstdc++ being only 8 elements big instead of e.g. 32 and the additional complexity of the iterator). The problem is the allocation of big deques (multiple GB) taking enormously long (many minutes) due to the sheer number of calls to cudaMallocManaged which is probably not optimized for this kind of usage.

I tried to use std::pmr memory resource adaptors (monotonic_buffer_resource and unsynchronized_pool_resource) to allow for bigger allocations. The first problem was that heap allocations through std::pmr::new_delete_resource (default pmr memory resource being adapted by the pool) don’t seem to get intercepted by nvc++ and exchanged for cudaMallocManaged. The explanation for this is probably that the definition of this resource is precompiled (no templates) and nvc++ only intercepting calls to new/delete in code that is actually compiled with it. This was easy to fix by writing my own new_delete_resource.

Now the whole thing seems to work (without taking many minutes) but prints out different amounts (dependent on type of memory resource adaptor) of

free: cuMemFree returns error code 1

to the console.

When running with compute-sanitizer --leak-check full, it looks like a lot of memory is being leaked although I don’t trust that output completely, as it also shows a

Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.

before that (might be related to my Pascal GPU not being supported by compute-sanitizer?).

After the free: cuMemFree returns error code 1 it shows

Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuMemFree_v2.

which seems like a helpful description of the problem.

For comparison I also tried using

thrust::mr::unsynchronized_pool_resource<thrust::mr::new_delete_resource>

instead, which works but also gives the same free: cuMemFree returns error code 1 on freeing the memory (Interestingly compute-sanitizer has nothing to say about them in this case. It still reports a lot of memory being leaked though). Something seems fundamentally different between normal heap memory and managed memory which manifests when it is used through memory resources (alignment?).

When implementing an old-school non-polymorphic allocator behaving like a pool or monotonic buffer resource, I would somewhat expect the same errors to pop up, but I haven’t tried it yet as it is a bit more work. Does anyone have an idea why this happens and how to fix it (from my side or from nvc++'s, without using Thrust directly)?

I’m on “nvc++ 22.3-0 64-bit target on x86-64 Linux” with “gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0” on a GTX 1070. I compile with

nvc++ -O3 -std=c++20 -stdpar=gpu -gpu=cc61,cuda11.6`

saxpy_deque_thrust_mr.cpp (1.2 KB)
saxpy_deque_pmr.cpp (1.8 KB)

1 Like

Thanks paleonix. This one I’ll need to pass on to our C++ compiler team to get a better understand of what going one. Though the engineer that’s best to look isn’t feeling well today, but will take a look as soon as he’s able.

-Mat

1 Like

Hi paleonix,

Apologies for the late update. Engineering was able to fix this issue in our 22.7 release.

-Mat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.