CUDA calculate median of 4096 elements array

I have 50000 float arrays, and each array has 4096 elements. How to calculate each array’s median on GPU efficiently?
I have try thrust::sort, but the performance is bad. And I also try to write a CUDA kernel that each thread calculates an array’s median, the performance is worst than CPU.
If calculate a median in a block, the array length is larger than block threads limit 1024, I don’t know how to implement it, or any other methods? thanks!

Have you had a look at the following publication?

T. Ribizel and H. Anzt, “Parallel section on GPUs.” Parallel Computing, Vol. 91, Mar. 2020, 102588 (pre-print online).

It seems NVIDIA researchers also have a new paper out on this:

J. Zhang, A. Naruse, X. Li, Y, Wang, “Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov. 2023, pp. 1-13 (open access).

Do you need the exact median or would a pseudo median suffice?

As a soundness check:
Probably just guessing the medians and trying all combinations (average success after half of them) would be faster than reading the memory.

50000*4096*4 = 800 MB => 2.2 ms on a RTX 3060
50000*4096*4096/2 = 419 billion => 66 ms on a RTX 3060

Okay, not quite. But those were random guesses. If the guesses would be improved or the array be roughly pre-sorted into buckets to speed up the comparisons, you would be in the single-digit ms range.

My own personal preference would be to use the simplest code that meets the performance goal. You haven’t specified a performance goal, but we can try something that is approximately state-of-the-art for GPU based sorting, and then use that as a determinant to decide if something else is needed.

CUB offers a block level radix sort that could be used as a point of comparison. Sorting 4096 elements is well within the range of what a block level radix sort could easily accommodate. Here is a simple example:

# cat t268.cu
#include <cub/cub.cuh>
#include <iostream>

const int ns = 50000;
const int nTPB = 512;
const int nIPT = 8;
const int ss = nTPB*nIPT;
using mt = int;
__global__ void k(mt *d, mt *m)
{
    // Specialize BlockRadixSort for a 1D block of 512 threads owning 8 integer items each
    using BlockRadixSort = cub::BlockRadixSort<mt, nTPB, nIPT>;

    // Allocate shared memory for BlockRadixSort
    __shared__ typename BlockRadixSort::TempStorage temp_storage;

    // Obtain a segment of consecutive items across threads
    mt thread_keys[nIPT];
    for (int i = 0; i < nIPT; i++)
      thread_keys[i] = d[blockIdx.x*ss+i*nTPB+threadIdx.x];
    // Collectively sort the keys
    BlockRadixSort(temp_storage).Sort(thread_keys);
    if (threadIdx.x == nTPB>>1) m[blockIdx.x] =  thread_keys[0];
}

int main(){

  mt *d_d, *h_d, *d_m, *h_m;
  h_d = new mt[ss*ns];
  h_m = new mt[ns];
  cudaMalloc(&d_d, sizeof(mt)*ss*ns);
  cudaMalloc(&d_m, sizeof(mt)*ns);
  for (int i = 0; i < ns; i++)
    for (int j = 0; j < ss; j++) h_d[i*ss+j] = ss-j;
  cudaMemcpy(d_d, h_d, sizeof(mt)*ss*ns, cudaMemcpyHostToDevice);
  k<<<ns, nTPB>>>(d_d, d_m);
  cudaMemcpy(d_d, h_d, sizeof(mt)*ss*ns, cudaMemcpyHostToDevice);
  k<<<ns, nTPB>>>(d_d, d_m);
  cudaMemcpy(h_m, d_m, sizeof(mt)*ns, cudaMemcpyDeviceToHost);
  for (int i = 0; i < 10; i++)
    std::cout << h_m[i] << std::endl;
}
# nvcc -o t268 t268.cu -arch=sm_89
# nsys nvprof --print-gpu-trace ./t268
WARNING: t268 and any of its children processes will be profiled.

2049
2049
2049
2049
2049
2049
2049
2049
2049
2049
Generating '/tmp/nsys-report-9815.qdstrm'
[1/3] [========================100%] report3.nsys-rep
[2/3] [========================100%] report3.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

  Start (ns)    Duration (ns)  CorrId   GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm         Name
 -------------  -------------  ------  ------  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ------------------
 1,748,644,144    225,484,863     120                                                                                819.200          3,276.800  Pageable  Device    NVIDIA L4 (0)    1     7  [CUDA memcpy HtoD]
 1,974,722,736     10,052,811     121  50,000     1     1   512     1     1       55         0.019         0.000                                                     NVIDIA L4 (0)    1     7  k(int *, int *)
 1,984,957,371    337,949,599     122                                                                                819.200          1,638.400  Pageable  Device    NVIDIA L4 (0)    1     7  [CUDA memcpy HtoD]
 2,322,915,898     10,053,771     123  50,000     1     1   512     1     1       55         0.019         0.000                                                     NVIDIA L4 (0)    1     7  k(int *, int *)
 2,332,974,405         29,184     124                                                                                  0.200          6,853.000  Device    Pageable  NVIDIA L4 (0)    1     7  [CUDA memcpy DtoH]

Generated:
    /root/bobc/report3.nsys-rep
    /root/bobc/report3.sqlite
#

So this simplistic brute-force approach (sorting all the data) requires about 10ms, on this L4 GPU.

My L4 GPU has a peak theoretical memory bandwidth of 300GB/s. In use, the maximum measurable memory bandwidth is in the range of 220GB/s - 250GB/s depending on measurement method. If we use the 250GB/s number there, and use it to estimate how many global read and write steps we could accommodate for the algorithm (an analysis approach suggested by one of the papers that njuffa linked) then we could observe that it would correspond to approximately 2-4 global read or global write operations per element in your data set:

50000*4096*4 = 800MB of data
800MB / 250GB/s  = 3.2ms for one read or one write operation on that data set size on the L4

Therefore we could lower-bound the best imaginable algorithm at 3.2ms on my L4 GPU. (You should not expect to be able to achieve an exact result without reading all the data at least once.) If you allow for a small number of additional operations on the data, then you observe that 10ms is a fairly good number (my own opinion). Given that what I am doing is a sort operation (not the best imaginable algorithm), we could lower-bound the sort at 6.4ms. In that light, the 10ms number looks even better.