Shared memory write conflicts Looking for a little help...

Kyle_Spagnoli · August 30, 2007, 8:48pm

I was wondering if anyone could help me solve a problem I’ve encountered in my CUDA code with regards to shared memory write conflicts.

Basic problem is that writes to shared memory aren’t atomic. Because of this, I sometimes get unpredictable values in my “results” array. I added the “new_i” variable so that threads potentially would never write to the same location at the same time. However, this would only be assured to work if all threads were executed 100% in parallel - which I know isn’t the actual case.

I’m thinking I can solve this more understand of warps? I’m fairly new to CUDA programming and haven’t had much experience with the concept.

Here some pseudo-code to explain my issue. It’s being launched with 256 threads per block.

__shared__ float results[256]

// Zero out results[]

__syncthreads();

for (j=0; j<100; j++)

{

 // Calculate valid

 if (valid)

  {

    for(i=0; i < 256; i++)

    {

       new_i = (i + threadIdx.x) % 256; 

      // Calculate "value"

      results [new_i] += value;

    } 

  }

}

__syncthreads();

// Write results to global memory

Any help would be much appreciated. At first glance, the histogram64 SDK example might be useful.

paulius · August 30, 2007, 9:49pm

I believe the unexpected results you’re seeing are due to race conditions in your code. As you mentioned, shared memory accesses are not atomic, so it is possible that another thread modified a particular location between reading it and writing of the incremented value (due to dynamic scheduling of warps).

The best thing would be to rewrite the program to where it doesn’t have such critical sections. If that’s not viable, you’ll have to go with one of the solutions used in non-atomic histogram samples.

Paulius

Kyle_Spagnoli · August 31, 2007, 4:15pm

Are there plans to add atomic writes to shared memory in future compute versions?

paulius · August 31, 2007, 7:38pm

We don’t comment on unreleased hardware/software :)

Paulius

springc · September 6, 2007, 11:45pm

I guess that is the policy :( , but I have the same issue and hope you can have some atomicAdd for float point.

I read the histogram project. For a BIN_COUNT histogram, it requires BIN_COUNT * THREAD_PER_BLOCK storage. And if you have more than 1 block per multiprocessor (which is recommended to have at least 2), that put more strain to the 16KB share memory. For my problem, my “histogram” is in 3-D and it has to be in the global memory. If my problem is a true histogram, I could use atomicAdd, however, the value increment is floating point. So I need atomicAdd() in floating point.

BTW, I understand that for 8800, and your tesla line, it currently on 1.0 compute capability. I assume that is limited by hardware design and can’t be changed, right?

MisterAnderson42 · September 7, 2007, 12:47pm

My application has a 3D “Histogram” also. I’ve got an 8800 GTX, so no atomicAdd for me. Tests showed that the scatter memory access pattern of the atomicAdd “histogram” would kill my overall performance anyways. I tried another version of the program that had each block determine the value of a single “bin” of the histogram, but this has O(N^2) memory reads, so it is painfully slow for large datasets.

In the end, I found it faster to copy data back from the card histogram it on the CPU, then copy the histogram back to the card and continue the rest of the application on the GPU.

Topic		Replies	Views
Histogram computation CUDA Programming and Performance	7	6349	February 8, 2008
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8982	August 3, 2017
Performance of Histogram256 demo Atomic writes are slow when conflict CUDA Programming and Performance	5	5771	August 1, 2008
setting bits in shared memory CUDA Programming and Performance	16	17108	June 6, 2007
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2330	January 21, 2010
Best way to do scatter write without memory conflict? CUDA Programming and Performance	6	5719	April 2, 2009
shared memory intra-warp conflicts summing into shared memory, how? CUDA Programming and Performance	2	2792	September 5, 2009
Why can't we have blending engine CUDA Programming and Performance	11	8133	August 4, 2008
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1549	November 8, 2023
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8296	April 15, 2011

Shared memory write conflicts Looking for a little help...

Related topics