Hey guys,
I have the following problem. I’m trying to implement a histogram on CUDA and tried to improve my naive, global memory version, to a faster one making use of shared memory. Both implementations are working but somehow the shared memory one is giving me much worse performance. I’m using CUDA 8 and a Titan X (Pascal). Ẃhat I’m trying to implement is as follows:
- An image consisting of pixels is divided into CELLS of 8x8 pixels
- Each CELL has its own histogram of 18 BINS corresponding to 18 gradient directions, these directions are sp floats
- Each pixel in a cell contributes to the histogram of the cell its in, as well as to the cell on the right (x+1), the cell beneath it (y+1) and the cell diagonally to the bottomright (x+1,y+1).
The “Global version” makes use of atomicAdds in global memory
//To calculate the cell coordinates from the pixel coordinates
float yp = (y+0.5f)/cell_size - 0.5f;
int iyp = (int)std::floor(yp);
float xp = (x+0.5f)/cell_size + 0.5f;
int ixp = int(xp);
/////////////////////////////////////////////////////////////////////
//Some calculations for v11, v01, v10 and v00; the values to be added
/////////////////////////////////////////////////////////////////////
atomicAdd(&(dev_hist[(((iyp + 1)*width_cells_hist)+ixp)*no_bins+best_o]),v11);
atomicAdd(&(dev_hist[(((iyp + 1 + 1)*width_cells_hist)+ixp)*no_bins+best_o]),v01);
atomicAdd(&(dev_hist[(((iyp + 1)*width_cells_hist)+(ixp+1))*no_bins+best_o]),v10);
atomicAdd(&(dev_hist[(((iyp + 1 + 1)*width_cells_hist)+(ixp+1))*no_bins+best_o]),v00);
My blocksize is (32,32), so this results in 1024*4 = 4096 global atomic adds per block. Total kerneltime in NVVP is 5.794ms.
Now for my (first) shared memory implementation, the neighbouring cells in a block can do much of the atomic adds in shared memory. Each (32x32) block consists of 16 (4x4) CELLS of (8x8) pixels. So for every block I allocate space for the histograms of (4+1)x(4+1) cells, +1 because the bottom and most right cells also have to contribute to the cells beneath of and right to them. Then these 5x5*NO_BINS histograms per block are “glued” together in global memory with global atomic add.
//Calculate coordinates in "local" 5x5 cell area
float yp_sm = (threadIdx.y+1+0.5f)/cell_size - 0.5f;
int iyp_sm = (int)std::floor(yp_sm);
float xp_sm = (threadIdx.x+1+0.5f)/cell_size + 0.5f;
int ixp_sm = int(xp_sm);
int t = threadIdx.x + threadIdx.y * blockDim.x;
int nt = blockDim.x * blockDim.y;
/////////////////////////////////////////////////////////////////////
//Some calculations for v11, v01, v10 and v00; the values to be added
/////////////////////////////////////////////////////////////////////
int width_sm = std::ceil(blockDim.x/cell_size+1); //=5
int height_sm = std::ceil(blockDim.y/cell_size+1); //=5
__shared__ float smem[NO_BINS*5*5];
for (int i = t; i < no_bins*width_sm*height_sm; i += nt)
{
smem[i] = 0;
}
__syncthreads();
atomicAdd(&smem[(((iyp_sm+1))*width_sm+(ixp_sm))*no_bins+best_o], v11_sm);
atomicAdd(&smem[(((iyp_sm+1+1))*width_sm+(ixp_sm))*no_bins+best_o], v01_sm);
atomicAdd(&smem[(((iyp_sm+1))*width_sm+((ixp_sm+1)))*no_bins+best_o], v10_sm);
atomicAdd(&smem[(((iyp_sm+1+1))*width_sm+((ixp_sm+1)))*no_bins+best_o], v00_sm);
/////////////////////////////////////////////////////////////////////
//Some calculations to determine "starting point" coordinates in global memory
/////////////////////////////////////////////////////////////////////
__syncthreads();
if(t<no_bins*width_sm*height_sm)
{
int row = floor((float)t/(float)(no_bins*width_sm));
int column = t%(no_bins*height_sm);
int write_location = starting_point + row*width_cells_hist*no_bins+column;
atomicAdd(&dev_hist[write_location],smem[t]);
}
This implementation results in 10244 = 4096 shard memory atomic adds per block, and reduces the amount of global atomic adds to no_binswidth_smheight_sm = 185*5 = 450.
While this works, performance is much worse: 10.988ms kerneltime in NVVP. I understand that there are many atomic “conflicts”, since 4 groups of 8x8 pixels can write to each 18 bin histogram, but this is no different for the global version. Also NVVP report a shared memory efficiency of 1.2%. The access of each thread to the shared memory is pretty much random, so I understand that there are some bank conflicts but not why the number is THAT low.
When I change the atomicAdds in shared memory to normal “+=” operations (resulting in false outcomes), the shared memory efficiency increases to 56.5%, a number that I makes more sense to me. Kernel time decreases to 2.911ms.
So what I don’t understand is why the global memory version is so much faster than the shared memory one, while they’re both dealing with the same “atomic conflicts”. And why the shared memory efficiency is this low. Is this because of the different implemenations of atomic operations, as stated in “The CUDA handbook”:
" Unlike global memory, which implements atomics using single instructions (either GATOM or GRED, depending on whether the return value is used), shared memory atomics are implemented with explicit lock/unlock semantics, and the compiler emits code that causes each thread to loop over these lock operations untilthe thread has performed its atomic operation."?
Thanks for helping me in advance. Sorry if I left anything out, this is my first CUDA forum post, so please let me know if information is missing.