Is there a similar temporary-allocation feature like CUB for Thrust's thrust::sort_by_key?

tugrul_192bit · November 7, 2024, 8:30pm

When using constant-length data, I can cache the allocation for CUB’s sorting algorithms but with Thrust’s radix-sort (for integers in sort_by_key), according to profiler, the algorithm running-time is dominated by cudaMalloc/cudaFree called by Thrust.

I guess that’s because of radix-sort requires a lot of memory for binning.

Is there a way to somehow “cache” the allocations/deallocations for sort_by_key? (I also tried merge-sort by normal thrust sorting function but its slower than sort_by_key which uses ultra-fast radix-sort, even without allocations/cudaFree calls). It would be nice to give it preallocated bins when sorting same length arrays 1000 times in a row. The allocation is fast, but somethow for each sorting the cudaFree takes a lot of time:

Robert_Crovella · November 7, 2024, 8:57pm

you can specify a custom allocator for thrust. You can find forum articles with a bit if searching. 1 2

I haven’t tested any of this on the latest thrust versions.

tugrul_192bit · November 8, 2024, 10:36am

I was trying to find a shorter way to overcome the extra latency. From source codes of it, there’s too much customization to do. Before that, I tried something much simpler, separated all arrays (7 of them) and used only 1 array as a gather index array and sorted it instead, then used the array for gathering on 7 arrays (struct of arrays format, too much data were copied on sorting). So, it reduced the ratio of cudaFree latency compared to total latency:

Sorting only 1 value array (gather) instead of 7 value arrays made it 2x faster. If I can find a shorter version of custom-allocator, I’ll try that too. Thank you.

system · November 22, 2024, 10:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting in CUDA Is sorting in CUDA worth the trouble? CUDA Programming and Performance	15	8129	September 30, 2009
My own comparer for thrust::sort is very slow. CUDA Programming and Performance	3	3806	January 29, 2012
find the top Nth biggest of the elements CUDA Programming and Performance	12	5646	November 29, 2009
sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart CUDA Programming and Performance	4	1919	February 4, 2012
Thrust, Radix Sort and uint2 Key-Value Pairs How to sort uint2 pairs with thrust? CUDA Programming and Performance	5	2651	January 30, 2012
Question about sorting using CUDA CUDA Programming and Performance	1	3645	June 9, 2010
Bitonic Sort with CUDA CUDA Setup and Installation	4	5294	August 5, 2014
How to reuse temporary buffer across multiple calls of thrust::sort_by_key? CUDA Programming and Performance	2	374	September 28, 2022
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1203	June 19, 2015
Memory usage of thrust::stable_partition (with stencil) CUDA Programming and Performance	1	448	February 2, 2018

Is there a similar temporary-allocation feature like CUB for Thrust's thrust::sort_by_key?

Related topics