Memory usage of std::sort(par_unseq, ...) on growing range with stdpar

Hello NV team,

I wanted to use the std::execution::par_unseq execution policy to sort a std::vector<size_t> using nvc++ -stdpar, but I encountered an issue: as I gradually increase the size of the range being sorted, GPU memory usage keeps increasing, eventually exceeding the required memory to sort the entire std::vector.

Here is a minimal example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main(int argc, char const *argv[]) {
	size_t mode = 0;
	if (argc > 1) {
		mode = std::stoi(argv[1]);
	}
        
    size_t nbSteps = 100000;
    size_t deltaStep = 5;
    std::vector<size_t> target(nbSteps * deltaStep, 1);

	for(size_t i = 0; i < nbSteps; ++i) {
		std::cerr << "Steps " << i << "\r";

		if(mode == 0) {
			std::sort(std::execution::par_unseq, target.begin(), target.end());
		}
		else {
			std::sort(std::execution::par_unseq, target.begin(), target.begin() + (deltaStep * (i+1)));
		}		
	}

	return 0;
}

With mode == 0, the entire vector is sorted at each iteration, whereas with mode == 1, only a growing portion of the vector is sorted.
I monitored the GPU memory usage for both cases:

  • The first case (mode == 0) uses approximately 500 MB.
  • The second case (mode == 1) consumes 77 GB.

I understand that sorting buffers need to be resized in the second case, but it appears that previously allocated buffers are neither reused nor freed.

Since std::sort(std::execution::par_unseq, ...) is built on top of Thrust, I tested an equivalent scenario using Thrust directly:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/sort.h>

int main(int argc, char const *argv[]) {
    size_t mode = 0;
    if (argc > 1) {
        mode = std::stoi(argv[1]);
    }
        
    size_t nbSteps = 100000;
    size_t deltaStep = 5;

    thrust::device_vector<size_t> target(nbSteps * deltaStep, 1);

    for (size_t i = 0; i < nbSteps; ++i) {
        std::cerr << "Steps " << i << "\r";
        
         if (mode == 0) {
            thrust::sort(thrust::device, target.begin(), target.end());
        } else {
            thrust::sort(thrust::device, target.begin(), target.begin() + (deltaStep * (i + 1)));
        }
    }

    return 0;
}

In this case, GPU memory usage remains stable for both branches, staying around 500 MB.

I also performed a similar test using std::sort(std::execution::par_unseq, ...) to sort a std::vector<std::array<size_t, 2>>, which would not use a radix sort. The memory usage increase up to 80 GB before reaching the 100k iterations with mode == 1, while remaining around 500 MB with mode == 0.

Is this a known or expected behavior?

GPU memory usage was monitored using nvidia-smi --query-gpu=memory.used
The test cases were compiled with:
nvc++ main.cpp -Wall -Wextra -fast -O3 -DNDEBUG -march=native -std=c++20 -Minfo=accel -stdpar=gpu -std=gnu++20 -o main
I tested these codes using nvc++ versions 23.3 and 24.9 (the most recent version I have access to).

Hi Maxime,

Thanks for the report and great example!

I was was able to reproduce the issue, but wasn’t able to determine the core issue. Hence I created a problem report, TPR #37213, and sent it to engineering for investigation.

-Mat

1 Like

Hi Maxime,

Engineering investigate and let me know that the problem is with Thrust’s caching allocation. It seems that it’s not freeing or re-using old blocks. We’ve submitted a report to the Thrust team to investigate.

If you change your Thrust code to use the following, it should reproduce the error:

  thrust::sort(::thrust::cuda::par(
                  ::thrust::detail::single_device_tls_caching_allocator()),
               target.begin(), target.begin() + (deltaStep * (i + 1)));

-Mat

Hi Mat,

Thank you for the investigation and the quick feedback. The suggested change to the Thrust code does reproduce the error on my end as well. I appreciate the update and the report to the Thrust team. I’ll keep an eye out for any further developments.

Maxime