I think i came with idea of really fast 256-bin histogram implementation that doesn’t depend on atomics speed or frequency distribution. I believe it should work at memory speed on almost any SM 2.0+ GPU. Please review this design:
- Since histogramming allows to make a lot of ILP, sacrifice TLP. Alloc
shared byte freq[32*256]
per warp and give to each lane its own set of 256 counters. This means 8 KB of shared memory per warp, and only 6-12 warps/SM. fine
- Comletely avoid bank conflicts - lane N owns freq elements only in bank N. So the code may look like
// x is input byte idx = (x&3) + (x/4)*128 + LANEID*4 freq[idx]++
- Now we see that most time is spent in calculation of idx. Fix that by using thread block of 4 warps, so warp N owns byte N of each 32-bit word in the freq:
shared byte freq[128*256] base_idx = (threadIdx.x%32)*4 + threadIdx.x/32 // avoid bank conflicts inside warp idx = x*128 + base_idx freq[idx]++
So, finally we have 3-4 arithmetic operations per byte (shift, and+or, inc), plus 1.25 loads and 1 store. NVidia cards usually provide 5-10 IOPs and ~2 LD/ST ops per 1 byte of memory bandwidth, so it should run at memory speeds.
As it was already said, we will have little or no TLP (2/1/2/3/4 warps/sheduler on SM 2.x/3.x/5.0/5.2/6.0, correspondingly), so the code should preload data from global memory and compute idx, say, for 16 bytes simultaneously. This should be doable even on older GPUs witth 63 regs/thread limit.
Of course, these 8-bit counters also need to be added to wider ones. One possible solution is to sum 128 freq entries locally and then atomic_add them to global_freq:
for (i=threadIdx.x; i<256; i+=threadBlk.idx) for (j=base_idx; j<128+base_idx; j++) // instead of simple j=0..127 in order to avoid bank conflicts sum += freq[i*128 + (j%128)] atomic_add(global_freq[i], sum)
The real code will be faster by dealing with 4-byte words