Memory usage of std::sort(par_unseq, ...) on growing range with stdpar

MaximeRambosson · March 18, 2025, 2:00pm

Hello NV team,

I wanted to use the std::execution::par_unseq execution policy to sort a std::vector<size_t> using nvc++ -stdpar, but I encountered an issue: as I gradually increase the size of the range being sorted, GPU memory usage keeps increasing, eventually exceeding the required memory to sort the entire std::vector.

Here is a minimal example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main(int argc, char const *argv[]) {
	size_t mode = 0;
	if (argc > 1) {
		mode = std::stoi(argv[1]);
	}
        
    size_t nbSteps = 100000;
    size_t deltaStep = 5;
    std::vector<size_t> target(nbSteps * deltaStep, 1);

	for(size_t i = 0; i < nbSteps; ++i) {
		std::cerr << "Steps " << i << "\r";

		if(mode == 0) {
			std::sort(std::execution::par_unseq, target.begin(), target.end());
		}
		else {
			std::sort(std::execution::par_unseq, target.begin(), target.begin() + (deltaStep * (i+1)));
		}		
	}

	return 0;
}

With mode == 0, the entire vector is sorted at each iteration, whereas with mode == 1, only a growing portion of the vector is sorted.
I monitored the GPU memory usage for both cases:

The first case (mode == 0) uses approximately 500 MB.
The second case (mode == 1) consumes 77 GB.

I understand that sorting buffers need to be resized in the second case, but it appears that previously allocated buffers are neither reused nor freed.

Since std::sort(std::execution::par_unseq, ...) is built on top of Thrust, I tested an equivalent scenario using Thrust directly:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/sort.h>

int main(int argc, char const *argv[]) {
    size_t mode = 0;
    if (argc > 1) {
        mode = std::stoi(argv[1]);
    }
        
    size_t nbSteps = 100000;
    size_t deltaStep = 5;

    thrust::device_vector<size_t> target(nbSteps * deltaStep, 1);

    for (size_t i = 0; i < nbSteps; ++i) {
        std::cerr << "Steps " << i << "\r";
        
         if (mode == 0) {
            thrust::sort(thrust::device, target.begin(), target.end());
        } else {
            thrust::sort(thrust::device, target.begin(), target.begin() + (deltaStep * (i + 1)));
        }
    }

    return 0;
}

In this case, GPU memory usage remains stable for both branches, staying around 500 MB.

I also performed a similar test using std::sort(std::execution::par_unseq, ...) to sort a std::vector<std::array<size_t, 2>>, which would not use a radix sort. The memory usage increase up to 80 GB before reaching the 100k iterations with mode == 1, while remaining around 500 MB with mode == 0.

Is this a known or expected behavior?

GPU memory usage was monitored using nvidia-smi --query-gpu=memory.used
The test cases were compiled with:
nvc++ main.cpp -Wall -Wextra -fast -O3 -DNDEBUG -march=native -std=c++20 -Minfo=accel -stdpar=gpu -std=gnu++20 -o main
I tested these codes using nvc++ versions 23.3 and 24.9 (the most recent version I have access to).

MatColgrove · March 18, 2025, 6:05pm

Hi Maxime,

Thanks for the report and great example!

I was was able to reproduce the issue, but wasn’t able to determine the core issue. Hence I created a problem report, TPR #37213, and sent it to engineering for investigation.

-Mat

MatColgrove · March 24, 2025, 3:32pm

Hi Maxime,

Engineering investigate and let me know that the problem is with Thrust’s caching allocation. It seems that it’s not freeing or re-using old blocks. We’ve submitted a report to the Thrust team to investigate.

If you change your Thrust code to use the following, it should reproduce the error:

  thrust::sort(::thrust::cuda::par(
                  ::thrust::detail::single_device_tls_caching_allocator()),
               target.begin(), target.begin() + (deltaStep * (i + 1)));

-Mat

MaximeRambosson · March 25, 2025, 1:01pm

Hi Mat,

Thank you for the investigation and the quick feedback. The suggested change to the Thrust code does reproduce the error on my end as well. I appreciate the update and the report to the Thrust team. I’ll keep an eye out for any further developments.

Maxime

Topic		Replies	Views
Memory-Usage on the GPU twice as the GPU Memory Usage of the process nvc, nvc++ and nvfortran	6	689	February 18, 2021
Vector push_back in Thrust device code and strange behavior of nvc++/stdpar nvc, nvc++ and nvfortran cuda	0	1093	January 18, 2022
Nvc++ & external CUDA-thrust conflicts for -stdpar offload nvc, nvc++ and nvfortran	5	450	December 12, 2022
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1176	June 19, 2015
Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type nvc, nvc++ and nvfortran	7	349	April 2, 2024
Thrust functor return array; code optimization CUDA Programming and Performance	3	1115	December 27, 2019
thrust issue? please help me! someone familiar with thrust. CUDA Programming and Performance	4	1892	July 6, 2018
Problem with OpenAcc and CPP STL nvc, nvc++ and nvfortran cuda	17	729	January 26, 2024
Does thrust::device_vector::resize() cause reallocation when resizing to a smaller size? CUDA Programming and Performance	1	585	December 2, 2022
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16781	May 30, 2009

Memory usage of std::sort(par_unseq, ...) on growing range with stdpar

Related topics