I played around a bit with using std::deque
as container in GPU-accelerated std-parallelized code samples. It seems to work in general and profiling with nsys
shows kernels being “only” ~30% slower than when using std::vector
(probably due to the internal array of std::deque
in libstdc++ being only 8 elements big instead of e.g. 32 and the additional complexity of the iterator). The problem is the allocation of big deques (multiple GB) taking enormously long (many minutes) due to the sheer number of calls to cudaMallocManaged
which is probably not optimized for this kind of usage.
I tried to use std::pmr
memory resource adaptors (monotonic_buffer_resource
and unsynchronized_pool_resource
) to allow for bigger allocations. The first problem was that heap allocations through std::pmr::new_delete_resource
(default pmr memory resource being adapted by the pool) don’t seem to get intercepted by nvc++ and exchanged for cudaMallocManaged
. The explanation for this is probably that the definition of this resource is precompiled (no templates) and nvc++ only intercepting calls to new
/delete
in code that is actually compiled with it. This was easy to fix by writing my own new_delete_resource
.
Now the whole thing seems to work (without taking many minutes) but prints out different amounts (dependent on type of memory resource adaptor) of
free: cuMemFree returns error code 1
to the console.
When running with compute-sanitizer --leak-check full
, it looks like a lot of memory is being leaked although I don’t trust that output completely, as it also shows a
Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.
before that (might be related to my Pascal GPU not being supported by compute-sanitizer?).
After the free: cuMemFree returns error code 1
it shows
Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuMemFree_v2.
which seems like a helpful description of the problem.
For comparison I also tried using
thrust::mr::unsynchronized_pool_resource<thrust::mr::new_delete_resource>
instead, which works but also gives the same free: cuMemFree returns error code 1
on freeing the memory (Interestingly compute-sanitizer
has nothing to say about them in this case. It still reports a lot of memory being leaked though). Something seems fundamentally different between normal heap memory and managed memory which manifests when it is used through memory resources (alignment?).
When implementing an old-school non-polymorphic allocator behaving like a pool or monotonic buffer resource, I would somewhat expect the same errors to pop up, but I haven’t tried it yet as it is a bit more work. Does anyone have an idea why this happens and how to fix it (from my side or from nvc++'s, without using Thrust directly)?
I’m on “nvc++ 22.3-0 64-bit target on x86-64 Linux” with “gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0” on a GTX 1070. I compile with
nvc++ -O3 -std=c++20 -stdpar=gpu -gpu=cc61,cuda11.6`
saxpy_deque_thrust_mr.cpp (1.2 KB)
saxpy_deque_pmr.cpp (1.8 KB)