Unexpected behavior of shared memory array

jacoblyles · July 1, 2009, 9:04pm

In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom to whatever value is calculated for numer by the thread in the block with the largest threadIdx.x, rather than the sum of the numer value calculated across all the threads in the block. Does anyone know what is going on?

extern __shared__ float s_shared[];

	

	float numer = //calculate numerator

	

	s_shared[threadIdx.x] = numer;

	s_shared[blockDim.x] += numer;

	__syncthreads();

	

	float denom = s_shared[blockDim.x];

	float result = numer/denom;

“result” should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x is the maximum, and some other value not confined to the range for the other threads in the block.

Noel_Lopes · July 1, 2009, 10:17pm

In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom to whatever value is calculated for numer by the thread in the block with the largest threadIdx.x, rather than the sum of the numer value calculated across all the threads in the block. Does anyone know what is going on?
extern __shared__ float s_shared[];

	

	float numer = //calculate numerator

	

	s_shared[threadIdx.x] = numer;

	s_shared[blockDim.x] += numer;

	__syncthreads();

	

	float denom = s_shared[blockDim.x];

	float result = numer/denom;
“result” should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x is the maximum, and some other value not confined to the range for the other threads in the block.

You are missing a __syncthreads();

// …

s_shared[threadIdx.x] = numer;

__syncthreads();

s_shared[blockDim.x] += numer;

__syncthreads();

// …

guernika · July 2, 2009, 7:02am

I am not sure but I think your kernel is not correct,

because blockDim.x threads try to write concurrently

in s_shared[blockDim.x].

Hence, the final value of s_shared[blockDim.x] is unpredictable.

Hope this helps

Francesco

Topic		Replies	Views
do not understand thread/block division CUDA Programming and Performance	10	2799	April 23, 2012
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1482	November 8, 2023
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1941	March 1, 2011
Interpretation of Kernel CUDA Programming and Performance	4	3083	August 11, 2009
strange error about shared memory CUDA Programming and Performance	4	2309	November 30, 2007
Question about using shared memory CUDA Programming and Performance	1	5019	September 10, 2009
find minimum num in array #2 CUDA Programming and Performance	15	6506	October 31, 2011
Turning crazy with shared memory There is something I am still missing with shared memory... CUDA Programming and Performance	3	2491	June 8, 2011
Incorrect result while using shared memory to get maximum value CUDA Programming and Performance	3	369	November 20, 2021
IS __syncthread() resetting shared memory values? CUDA Programming and Performance	2	713	August 9, 2018

Unexpected behavior of shared memory array

Related topics