Hello NV team,
I wanted to use the std::execution::par_unseq
execution policy to sort a std::vector<size_t>
using nvc++ -stdpar
, but I encountered an issue: as I gradually increase the size of the range being sorted, GPU memory usage keeps increasing, eventually exceeding the required memory to sort the entire std::vector
.
Here is a minimal example:
#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>
int main(int argc, char const *argv[]) {
size_t mode = 0;
if (argc > 1) {
mode = std::stoi(argv[1]);
}
size_t nbSteps = 100000;
size_t deltaStep = 5;
std::vector<size_t> target(nbSteps * deltaStep, 1);
for(size_t i = 0; i < nbSteps; ++i) {
std::cerr << "Steps " << i << "\r";
if(mode == 0) {
std::sort(std::execution::par_unseq, target.begin(), target.end());
}
else {
std::sort(std::execution::par_unseq, target.begin(), target.begin() + (deltaStep * (i+1)));
}
}
return 0;
}
With mode == 0
, the entire vector is sorted at each iteration, whereas with mode == 1
, only a growing portion of the vector is sorted.
I monitored the GPU memory usage for both cases:
- The first case (
mode == 0
) uses approximately 500 MB. - The second case (
mode == 1
) consumes 77 GB.
I understand that sorting buffers need to be resized in the second case, but it appears that previously allocated buffers are neither reused nor freed.
Since std::sort(std::execution::par_unseq, ...)
is built on top of Thrust, I tested an equivalent scenario using Thrust directly:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main(int argc, char const *argv[]) {
size_t mode = 0;
if (argc > 1) {
mode = std::stoi(argv[1]);
}
size_t nbSteps = 100000;
size_t deltaStep = 5;
thrust::device_vector<size_t> target(nbSteps * deltaStep, 1);
for (size_t i = 0; i < nbSteps; ++i) {
std::cerr << "Steps " << i << "\r";
if (mode == 0) {
thrust::sort(thrust::device, target.begin(), target.end());
} else {
thrust::sort(thrust::device, target.begin(), target.begin() + (deltaStep * (i + 1)));
}
}
return 0;
}
In this case, GPU memory usage remains stable for both branches, staying around 500 MB.
I also performed a similar test using std::sort(std::execution::par_unseq, ...)
to sort a std::vector<std::array<size_t, 2>>
, which would not use a radix sort. The memory usage increase up to 80 GB before reaching the 100k iterations with mode == 1
, while remaining around 500 MB with mode == 0
.
Is this a known or expected behavior?
GPU memory usage was monitored using nvidia-smi --query-gpu=memory.used
The test cases were compiled with:
nvc++ main.cpp -Wall -Wextra -fast -O3 -DNDEBUG -march=native -std=c++20 -Minfo=accel -stdpar=gpu -std=gnu++20 -o main
I tested these codes using nvc++
versions 23.3 and 24.9 (the most recent version I have access to).