I played around a bit with using
std::deque as container in GPU-accelerated std-parallelized code samples. It seems to work in general and profiling with
nsys shows kernels being “only” ~30% slower than when using
std::vector (probably due to the internal array of
std::deque in libstdc++ being only 8 elements big instead of e.g. 32 and the additional complexity of the iterator). The problem is the allocation of big deques (multiple GB) taking enormously long (many minutes) due to the sheer number of calls to
cudaMallocManaged which is probably not optimized for this kind of usage.
I tried to use
std::pmr memory resource adaptors (
unsynchronized_pool_resource) to allow for bigger allocations. The first problem was that heap allocations through
std::pmr::new_delete_resource (default pmr memory resource being adapted by the pool) don’t seem to get intercepted by nvc++ and exchanged for
cudaMallocManaged. The explanation for this is probably that the definition of this resource is precompiled (no templates) and nvc++ only intercepting calls to
delete in code that is actually compiled with it. This was easy to fix by writing my own
Now the whole thing seems to work (without taking many minutes) but prints out different amounts (dependent on type of memory resource adaptor) of
free: cuMemFree returns error code 1
to the console.
When running with
compute-sanitizer --leak-check full, it looks like a lot of memory is being leaked although I don’t trust that output completely, as it also shows a
Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.
before that (might be related to my Pascal GPU not being supported by compute-sanitizer?).
free: cuMemFree returns error code 1 it shows
Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuMemFree_v2.
which seems like a helpful description of the problem.
For comparison I also tried using
instead, which works but also gives the same
free: cuMemFree returns error code 1 on freeing the memory (Interestingly
compute-sanitizer has nothing to say about them in this case. It still reports a lot of memory being leaked though). Something seems fundamentally different between normal heap memory and managed memory which manifests when it is used through memory resources (alignment?).
When implementing an old-school non-polymorphic allocator behaving like a pool or monotonic buffer resource, I would somewhat expect the same errors to pop up, but I haven’t tried it yet as it is a bit more work. Does anyone have an idea why this happens and how to fix it (from my side or from nvc++'s, without using Thrust directly)?
I’m on “nvc++ 22.3-0 64-bit target on x86-64 Linux” with “gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0” on a GTX 1070. I compile with
nvc++ -O3 -std=c++20 -stdpar=gpu -gpu=cc61,cuda11.6`