Worse atomic performance in shared than global memory

Hey guys,

I have the following problem. I’m trying to implement a histogram on CUDA and tried to improve my naive, global memory version, to a faster one making use of shared memory. Both implementations are working but somehow the shared memory one is giving me much worse performance. I’m using CUDA 8 and a Titan X (Pascal). Ẃhat I’m trying to implement is as follows:

  • An image consisting of pixels is divided into CELLS of 8x8 pixels
  • Each CELL has its own histogram of 18 BINS corresponding to 18 gradient directions, these directions are sp floats
  • Each pixel in a cell contributes to the histogram of the cell its in, as well as to the cell on the right (x+1), the cell beneath it (y+1) and the cell diagonally to the bottomright (x+1,y+1).

The “Global version” makes use of atomicAdds in global memory

//To calculate the cell coordinates from the pixel coordinates
  float yp = (y+0.5f)/cell_size - 0.5f;
  int iyp = (int)std::floor(yp);

  float xp  = (x+0.5f)/cell_size + 0.5f;
  int   ixp = int(xp);

  //Some calculations for v11, v01, v10 and v00; the values to be added

  atomicAdd(&(dev_hist[(((iyp + 1)*width_cells_hist)+ixp)*no_bins+best_o]),v11);
  atomicAdd(&(dev_hist[(((iyp + 1 + 1)*width_cells_hist)+ixp)*no_bins+best_o]),v01);
  atomicAdd(&(dev_hist[(((iyp + 1)*width_cells_hist)+(ixp+1))*no_bins+best_o]),v10);
  atomicAdd(&(dev_hist[(((iyp + 1 + 1)*width_cells_hist)+(ixp+1))*no_bins+best_o]),v00);

My blocksize is (32,32), so this results in 1024*4 = 4096 global atomic adds per block. Total kerneltime in NVVP is 5.794ms.

Now for my (first) shared memory implementation, the neighbouring cells in a block can do much of the atomic adds in shared memory. Each (32x32) block consists of 16 (4x4) CELLS of (8x8) pixels. So for every block I allocate space for the histograms of (4+1)x(4+1) cells, +1 because the bottom and most right cells also have to contribute to the cells beneath of and right to them. Then these 5x5*NO_BINS histograms per block are “glued” together in global memory with global atomic add.

//Calculate coordinates in "local" 5x5 cell area
float yp_sm = (threadIdx.y+1+0.5f)/cell_size - 0.5f;
int iyp_sm = (int)std::floor(yp_sm);

float xp_sm  = (threadIdx.x+1+0.5f)/cell_size + 0.5f;
int   ixp_sm = int(xp_sm);

int t = threadIdx.x + threadIdx.y * blockDim.x;
int nt = blockDim.x * blockDim.y;

//Some calculations for v11, v01, v10 and v00; the values to be added

int width_sm = std::ceil(blockDim.x/cell_size+1); //=5
int height_sm = std::ceil(blockDim.y/cell_size+1); //=5

__shared__ float smem[NO_BINS*5*5];
for (int i = t; i < no_bins*width_sm*height_sm; i += nt)
	smem[i] = 0;

atomicAdd(&smem[(((iyp_sm+1))*width_sm+(ixp_sm))*no_bins+best_o], v11_sm);
atomicAdd(&smem[(((iyp_sm+1+1))*width_sm+(ixp_sm))*no_bins+best_o], v01_sm);
atomicAdd(&smem[(((iyp_sm+1))*width_sm+((ixp_sm+1)))*no_bins+best_o], v10_sm);
atomicAdd(&smem[(((iyp_sm+1+1))*width_sm+((ixp_sm+1)))*no_bins+best_o], v00_sm);

//Some calculations to determine "starting point" coordinates in global memory 


	int row = floor((float)t/(float)(no_bins*width_sm));
	int column = t%(no_bins*height_sm);
	int write_location = starting_point + row*width_cells_hist*no_bins+column;

This implementation results in 10244 = 4096 shard memory atomic adds per block, and reduces the amount of global atomic adds to no_binswidth_smheight_sm = 185*5 = 450.

While this works, performance is much worse: 10.988ms kerneltime in NVVP. I understand that there are many atomic “conflicts”, since 4 groups of 8x8 pixels can write to each 18 bin histogram, but this is no different for the global version. Also NVVP report a shared memory efficiency of 1.2%. The access of each thread to the shared memory is pretty much random, so I understand that there are some bank conflicts but not why the number is THAT low.

When I change the atomicAdds in shared memory to normal “+=” operations (resulting in false outcomes), the shared memory efficiency increases to 56.5%, a number that I makes more sense to me. Kernel time decreases to 2.911ms.

So what I don’t understand is why the global memory version is so much faster than the shared memory one, while they’re both dealing with the same “atomic conflicts”. And why the shared memory efficiency is this low. Is this because of the different implemenations of atomic operations, as stated in “The CUDA handbook”:

" Unlike global memory, which implements atomics using single instructions (either GATOM or GRED, depending on whether the return value is used), shared memory atomics are implemented with explicit lock/unlock semantics, and the compiler emits code that causes each thread to loop over these lock operations untilthe thread has performed its atomic operation."?

Thanks for helping me in advance. Sorry if I left anything out, this is my first CUDA forum post, so please let me know if information is missing.

After some more digging I found this thread:

Which describes the same problem, and refers to the same part of the cuda handbook as the explanation.

https://devblogs.nvidia.com/parallelforall/inside-pascal/ says the following about atomics on Pascal:

"Atomic memory operations are important in parallel programming, allowing concurrent threads to correctly perform read-modify-write operations on shared data. Kepler significantly increased the throughput of atomic operations to global memory compared to the earlier Fermi architecture; however, both Fermi and Kepler implemented shared memory atomics using an expensive lock/update/unlock pattern.

Maxwell improved this by implementing native hardware support for shared memory atomic operations for 32-bit integers, and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions with reduced overhead (compared to the Fermi and Kepler methods which were implemented in software)."

This indicated that there is indeed still no support for native hardware atomicAdd for floats in shared memory. As a solution I tried an atomicAdd implementation using the supported CAS (like in https://www.sharcnet.ca/help/index.php/CUDA_tips_and_tricks):

__device__ inline void atomicAdd_sm(float *addr, float value){
    int* addr_as_int = (int*)addr;
    int old = *addr_as_int, assumed;
        assumed = old;
        old = atomicCAS((unsigned int*)addr, assumed, __float_as_int(value+__int_as_float(assumed)));
    }while( old!=assumed );

And one using atomicExch (like in https://devtalk.nvidia.com/default/topic/458062/atomicadd-float-float-atomicmul-float-float-/):

__device__ inline float atomicAdd_sm(float* address, float value)

  float old = value;  
  while ((old = atomicExch(address, atomicExch(address, 0.0f)+old))!=0.0f);

While both provided an increase in speed, they still were far from the global memory atomics performance. Does anyone has a better idea how I can speed up this histogramization process?

maxwell and pascal introduced native shared mem atomics, but only for specific operations (e.g. 32-bit int atomicAdd):


any chance you could rework your histogramming to use 32-bit integer, rather than float?

For reference, the sharcnet tip is covered in the programming guide:


You might get more help if you provide a runnable test case.

Hey txbob,

Thanks for you reply and the reference. Unfortunately I can’t really switch to 32-bit ints.

Looks like you are doing HOG feature calculation on the GPU.

You could try to make the parallelization coarser, by letting one GPU thread calculate one block histogram as recommended in [1]. Excerpt from there (chapter 3.2): " After checking and discarding several cooperative
strategies, all of them limited by atomic memory operations and thread divergence, we decided
to map one thread to the task of computing one block histogram"

We employed (with success) a similar strategy in our GPU implmentation of SIFT feature exctraction.

Disadvantage is that you will get much less GPU threads, maybe not enough for high-end GPUs and small images. One could think then of processing multiple images at once by invoking the kernel on multiple CUDA streams.

[1] GPU-based pedestrian detection for autonomous driving, Campmany et al, 2016, http://www.sciencedirect.com/science/article/pii/S1877050916309395

some additional dialog on atomics, global vs. shared, pointing out that in some scenarios global atomics may be faster:


“As we can see from the Kepler performance plots, the global atomics perform better than shared in most cases, except the images with high entropy. … In the 100% entropy case (white noise) we have perfect bin distribution and atomic conflicts do not play a significant role in performance; in this special case the shared memory version helps save bandwidth and outperforms the global atomics.”

Thanks for your responses txbob and Hannes,

You’re right about the HoG algorithm, I’ll take a look at that paper. It also contains a couple useful references that are worth looking into. The links should provide useful references for a written explanation.

Thanks for your responses txbob and Hannes,

You’re right about the HoG algorithm, I’ll take a look at that paper and later on I can try the one thread per cell approach. It also contains a couple useful references that are worth looking into. The links should provide useful references for a written explanation!