I have 50000 float arrays, and each array has 4096 elements. How to calculate each array’s median on GPU efficiently?
I have try thrust::sort, but the performance is bad. And I also try to write a CUDA kernel that each thread calculates an array’s median, the performance is worst than CPU.
If calculate a median in a block, the array length is larger than block threads limit 1024, I don’t know how to implement it, or any other methods? thanks!
Have you had a look at the following publication?
T. Ribizel and H. Anzt, “Parallel section on GPUs.” Parallel Computing, Vol. 91, Mar. 2020, 102588 (pre-print online).
It seems NVIDIA researchers also have a new paper out on this:
J. Zhang, A. Naruse, X. Li, Y, Wang, “Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov. 2023, pp. 1-13 (open access).
Do you need the exact median or would a pseudo median suffice?
As a soundness check:
Probably just guessing the medians and trying all combinations (average success after half of them) would be faster than reading the memory.
50000*4096*4 = 800 MB
=> 2.2 ms on a RTX 3060
50000*4096*4096/2 = 419 billion
=> 66 ms on a RTX 3060
Okay, not quite. But those were random guesses. If the guesses would be improved or the array be roughly pre-sorted into buckets to speed up the comparisons, you would be in the single-digit ms range.
My own personal preference would be to use the simplest code that meets the performance goal. You haven’t specified a performance goal, but we can try something that is approximately state-of-the-art for GPU based sorting, and then use that as a determinant to decide if something else is needed.
CUB offers a block level radix sort that could be used as a point of comparison. Sorting 4096 elements is well within the range of what a block level radix sort could easily accommodate. Here is a simple example:
# cat t268.cu
#include <cub/cub.cuh>
#include <iostream>
const int ns = 50000;
const int nTPB = 512;
const int nIPT = 8;
const int ss = nTPB*nIPT;
using mt = int;
__global__ void k(mt *d, mt *m)
{
// Specialize BlockRadixSort for a 1D block of 512 threads owning 8 integer items each
using BlockRadixSort = cub::BlockRadixSort<mt, nTPB, nIPT>;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items across threads
mt thread_keys[nIPT];
for (int i = 0; i < nIPT; i++)
thread_keys[i] = d[blockIdx.x*ss+i*nTPB+threadIdx.x];
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
if (threadIdx.x == nTPB>>1) m[blockIdx.x] = thread_keys[0];
}
int main(){
mt *d_d, *h_d, *d_m, *h_m;
h_d = new mt[ss*ns];
h_m = new mt[ns];
cudaMalloc(&d_d, sizeof(mt)*ss*ns);
cudaMalloc(&d_m, sizeof(mt)*ns);
for (int i = 0; i < ns; i++)
for (int j = 0; j < ss; j++) h_d[i*ss+j] = ss-j;
cudaMemcpy(d_d, h_d, sizeof(mt)*ss*ns, cudaMemcpyHostToDevice);
k<<<ns, nTPB>>>(d_d, d_m);
cudaMemcpy(d_d, h_d, sizeof(mt)*ss*ns, cudaMemcpyHostToDevice);
k<<<ns, nTPB>>>(d_d, d_m);
cudaMemcpy(h_m, d_m, sizeof(mt)*ns, cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++)
std::cout << h_m[i] << std::endl;
}
# nvcc -o t268 t268.cu -arch=sm_89
# nsys nvprof --print-gpu-trace ./t268
WARNING: t268 and any of its children processes will be profiled.
2049
2049
2049
2049
2049
2049
2049
2049
2049
2049
Generating '/tmp/nsys-report-9815.qdstrm'
[1/3] [========================100%] report3.nsys-rep
[2/3] [========================100%] report3.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
------------- ------------- ------ ------ ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------- --- ---- ------------------
1,748,644,144 225,484,863 120 819.200 3,276.800 Pageable Device NVIDIA L4 (0) 1 7 [CUDA memcpy HtoD]
1,974,722,736 10,052,811 121 50,000 1 1 512 1 1 55 0.019 0.000 NVIDIA L4 (0) 1 7 k(int *, int *)
1,984,957,371 337,949,599 122 819.200 1,638.400 Pageable Device NVIDIA L4 (0) 1 7 [CUDA memcpy HtoD]
2,322,915,898 10,053,771 123 50,000 1 1 512 1 1 55 0.019 0.000 NVIDIA L4 (0) 1 7 k(int *, int *)
2,332,974,405 29,184 124 0.200 6,853.000 Device Pageable NVIDIA L4 (0) 1 7 [CUDA memcpy DtoH]
Generated:
/root/bobc/report3.nsys-rep
/root/bobc/report3.sqlite
#
So this simplistic brute-force approach (sorting all the data) requires about 10ms, on this L4 GPU.
My L4 GPU has a peak theoretical memory bandwidth of 300GB/s. In use, the maximum measurable memory bandwidth is in the range of 220GB/s - 250GB/s depending on measurement method. If we use the 250GB/s number there, and use it to estimate how many global read and write steps we could accommodate for the algorithm (an analysis approach suggested by one of the papers that njuffa linked) then we could observe that it would correspond to approximately 2-4 global read or global write operations per element in your data set:
50000*4096*4 = 800MB of data
800MB / 250GB/s = 3.2ms for one read or one write operation on that data set size on the L4
Therefore we could lower-bound the best imaginable algorithm at 3.2ms on my L4 GPU. (You should not expect to be able to achieve an exact result without reading all the data at least once.) If you allow for a small number of additional operations on the data, then you observe that 10ms is a fairly good number (my own opinion). Given that what I am doing is a sort operation (not the best imaginable algorithm), we could lower-bound the sort at 6.4ms. In that light, the 10ms number looks even better.