When using constant-length data, I can cache the allocation for CUB’s sorting algorithms but with Thrust’s radix-sort (for integers in sort_by_key), according to profiler, the algorithm running-time is dominated by cudaMalloc/cudaFree called by Thrust.
I guess that’s because of radix-sort requires a lot of memory for binning.
Is there a way to somehow “cache” the allocations/deallocations for sort_by_key? (I also tried merge-sort by normal thrust sorting function but its slower than sort_by_key which uses ultra-fast radix-sort, even without allocations/cudaFree calls). It would be nice to give it preallocated bins when sorting same length arrays 1000 times in a row. The allocation is fast, but somethow for each sorting the cudaFree takes a lot of time:
I was trying to find a shorter way to overcome the extra latency. From source codes of it, there’s too much customization to do. Before that, I tried something much simpler, separated all arrays (7 of them) and used only 1 array as a gather index array and sorted it instead, then used the array for gathering on 7 arrays (struct of arrays format, too much data were copied on sorting). So, it reduced the ratio of cudaFree latency compared to total latency:
Sorting only 1 value array (gather) instead of 7 value arrays made it 2x faster. If I can find a shorter version of custom-allocator, I’ll try that too. Thank you.