Fast 256-bin histogram

I think i came with idea of really fast 256-bin histogram implementation that doesn’t depend on atomics speed or frequency distribution. I believe it should work at memory speed on almost any SM 2.0+ GPU. Please review this design:

  1. Since histogramming allows to make a lot of ILP, sacrifice TLP. Alloc

shared byte freq[32*256]

per warp and give to each lane its own set of 256 counters. This means 8 KB of shared memory per warp, and only 6-12 warps/SM. fine

  1. Comletely avoid bank conflicts - lane N owns freq elements only in bank N. So the code may look like
// x is input byte
idx = (x&3) + (x/4)*128 + LANEID*4
  1. Now we see that most time is spent in calculation of idx. Fix that by using thread block of 4 warps, so warp N owns byte N of each 32-bit word in the freq:
shared byte freq[128*256]
base_idx = (threadIdx.x%32)*4 + threadIdx.x/32   // avoid bank conflicts inside warp

idx = x*128 + base_idx

So, finally we have 3-4 arithmetic operations per byte (shift, and+or, inc), plus 1.25 loads and 1 store. NVidia cards usually provide 5-10 IOPs and ~2 LD/ST ops per 1 byte of memory bandwidth, so it should run at memory speeds.

As it was already said, we will have little or no TLP (2/1/2/3/4 warps/sheduler on SM 2.x/3.x/5.0/5.2/6.0, correspondingly), so the code should preload data from global memory and compute idx, say, for 16 bytes simultaneously. This should be doable even on older GPUs witth 63 regs/thread limit.

Of course, these 8-bit counters also need to be added to wider ones. One possible solution is to sum 128 freq entries locally and then atomic_add them to global_freq:

for (i=threadIdx.x; i<256; i+=threadBlk.idx)
  for (j=base_idx; j<128+base_idx; j++)  // instead of simple j=0..127 in order to avoid bank conflicts
    sum += freq[i*128 + (j%128)]
  atomic_add(global_freq[i], sum)

The real code will be faster by dealing with 4-byte words

Thanks for posting this with code.

128*256=32768 bytes of shared memory per thread block will limit occupancy, particularly on Kepler or the proposed Pascal generation with less available shared memory.

In my experience for Maxwell atomicAdd() operations on shared memory 32-bit words are very fast and I wonder how that simple implementation would compare to your approach?
If I can find the time this will be interesting to test and compare. Maybe there is some ‘middle ground’ which will require less shared memory so that more thread blocks can be launched per SM.

I wrote up a quick naive very simple un-optimized 1024 bin histogram which only uses atomics on both shared and global memory without consideration of bank conflicts(hard-coded for 512 threads and a bin of size 1024);

For an input array of size 134,217,728 with 1024 histogram bins it takes 1.7 ms for the kernel but 44 ms to copy over the host memory to the device memory.

The main issue with a kernel like this is the global memory load pattern. NVVP profiling shows that the global memory bandwidth hits 305 GBs which is over 90% of theoretical for the Titan X.

Profiling also reveals that there are bank conflicts, but not enough to really make a huge difference.

The kernel itself profiles like this;

==5412== NVPROF is profiling process 5412, command: ConsoleApplication1.exe

CPU 4.5 GHz time=64.000000

Results GPU vs. CPU match!
==5412== Profiling application: ConsoleApplication1.exe
==5412== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
1.93179s  44.086ms                    -               -         -         -         -  512.00MB  11.342GB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
1.97592s  3.6480us                    -               -         -         -         -  4.0000KB  1.0457GB/s  GeForce GTX TIT         1         7  [CUDA memset]
1.97595s  1.7585ms             (96 1 1)       (512 1 1)        13  4.0039KB        0B         -           -  GeForce GTX TIT         1         7  hist_kernel(int4 const *, int*, int) [187]
1.97772s  2.0160us                    -               -         -         -         -  4.0000KB  1.8922GB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

I am sure there are much more optimized versions out there, but this implementation is very simple and easy to understand. The CPU version takes about 64 ms so if you consider the full process only slightly faster than an CPU implementation.

I wrote this up very quickly so let me know if there are bugs or if there is a much better implementation. This demo is hardcoded for an input array of a size which is a multiple of 4, but there are code examples out there which show how to handle odd array sizes while still being able to vectorize the memory loads;

What sort of utilization are you guys getting?

I happen to have some very fast histogram generation codes hidden in my drawers… So I’m curious :)

NVVP screenshot for memory utilization(which is classified as ‘Max’);

NVVP for this input set shows 10,615,808 bank conflicts with utilization classified as ‘Mid(6)’.

CudaaduC, thank you for sharing the code. you reached mem speed but it was easier goal:

  1. you processed 4-byte ints, so you need to make 4x less alu operaions per each byte read. please try 256-bin histogram of bytes

  2. random numbers makes less bank conflicts, but real data aren’t random. try your code on all-zeros. for maxwell, you can use this code to avoid bank conflicts:

shared int freq[256*32]

n = x*32 + LANEID  // x is input byte

and run 1024 threads per block

  1. well, the scheme i’ve described needs more work to implement and useful only on pre-maxwell gpus. so i just asking guys to criticize it. is it worth the work?

[Apologies if I am missing something fundamental - I haven’t been thinking about histogramming for a long while.]

I don’t think byte-sized frequencies are the best option - for every 255 values processed within a thread you will have to empty 256 bins, so this stage does not reduce the amount of data passed to the next stage.

Using 16-bit frequencies reduces TLP even more by allowing only 2 warps to run, but at least it gets useful work done.