I think i came with idea of really fast 256-bin histogram implementation that doesn’t depend on atomics speed or frequency distribution. I believe it should work at memory speed on almost any SM 2.0+ GPU. Please review this design:

- Since histogramming allows to make a lot of ILP, sacrifice TLP. Alloc

shared byte freq[32*256]

per warp and give to each lane its own set of 256 counters. This means 8 KB of shared memory per warp, and only 6-12 warps/SM. fine

- Comletely avoid bank conflicts - lane N owns freq elements only in bank N. So the code may look like

```
// x is input byte
idx = (x&3) + (x/4)*128 + LANEID*4
freq[idx]++
```

- Now we see that most time is spent in calculation of idx. Fix that by using thread block of 4 warps, so warp N owns byte N of each 32-bit word in the freq:

```
shared byte freq[128*256]
base_idx = (threadIdx.x%32)*4 + threadIdx.x/32 // avoid bank conflicts inside warp
idx = x*128 + base_idx
freq[idx]++
```

So, finally we have 3-4 arithmetic operations per byte (shift, and+or, inc), plus 1.25 loads and 1 store. NVidia cards usually provide 5-10 IOPs and ~2 LD/ST ops per 1 byte of memory bandwidth, so it should run at memory speeds.

As it was already said, we will have little or no TLP (2/1/2/3/4 warps/sheduler on SM 2.x/3.x/5.0/5.2/6.0, correspondingly), so the code should preload data from global memory and compute idx, say, for 16 bytes simultaneously. This should be doable even on older GPUs witth 63 regs/thread limit.

Of course, these 8-bit counters also need to be added to wider ones. One possible solution is to sum 128 freq entries locally and then atomic_add them to global_freq:

```
for (i=threadIdx.x; i<256; i+=threadBlk.idx)
for (j=base_idx; j<128+base_idx; j++) // instead of simple j=0..127 in order to avoid bank conflicts
sum += freq[i*128 + (j%128)]
atomic_add(global_freq[i], sum)
```

The real code will be faster by dealing with 4-byte words