Shared Mem (w/ & w/out extern)

Nikolai · October 2, 2009, 10:38pm

Hello everyone,

I have a little problem with the extern qualifier…

The kernel below sums the elements of a vector of 140 integers. It’s values begins at 1 all the way 140 (so the sum is 9870, n(n+1)/2)

However it only produces the correct result when I explicitly state the amount of static shared memory. If I want it to have the smem pseudo-dynamically allocated like in the code below, i get the wrong answer. In fact I get 36, which means it only sums together the first 8 numbers of the array.

Now please note, I’m using ONE block w/ only ONE thread for conceptual simplicity. I’m well aware that this is ridiculously inefficient BUT inefficiency shouldn’t change the answer. Since there is only 1 thread, there is no competition for the memory resources. It’s essentially 1 alu doing all the computation, right? (i’m also aware of the sdk example, but the src code implementing the smem isn’t so straightforward)

Thanks for your attention!

[codebox]#include <stdio.h>

#include <cuda.h>

global void reduction_kernel(int *data, int *output)

{

int tid = threadIdx.x;

int result, j;	

	

extern __shared__ int sdata[];

//__shared__ int sdata[140];

for ( j = 0 ; j < 140 ; j ++ )

	sdata[tid+j] = data[tid+j];

			

result = 0;

for ( j = 0 ; j < 140 ; j ++ )

	result += sdata[tid+j];



output[tid] = result;

}

int main()

{

int i, num_elements = 140;

int *h_data, *h_output;

int *d_data, *d_output;

int data_size = num_elements * sizeof(int);

int output_size = 1 * sizeof(int);



h_data   = (int*)malloc(data_size);

h_output = (int*)malloc(output_size);

cudaMalloc((void **) &d_data,  data_size);

cudaMalloc((void **) &d_output, output_size);



// fill host vector

for ( i = 0 ; i < num_elements ; i ++ )

	h_data[i] = i+1;



// transfer host vector to device

cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);



// execute kernel

reduction_kernel <<< 1, 1 >>> ( d_data, d_output );



// transfer device result to host

cudaMemcpy(h_output, d_output, output_size, cudaMemcpyDeviceToHost);

printf("Result = %d\n", h_output[0]);

}[/codebox]

seibert · October 2, 2009, 10:42pm

If you use the extern qualifier on shared memory, you required to set the third argument in the kernel launch configuration, specifying how many bytes of shared memory you want per block. This can be a value computed at runtime, but it has to be there.

Nikolai · October 2, 2009, 11:03pm

thanks seibert, that solved it!

Topic		Replies	Views
problem with shared mamery CUDA Programming and Performance	4	3176	May 11, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5451	April 30, 2010
shared memory CUDA Programming and Performance	2	2148	January 30, 2009
Parallel reduction problem CUDA Programming and Performance	1	5079	November 29, 2010
Interpretation of Kernel CUDA Programming and Performance	4	3082	August 11, 2009
About __device__ __shared__ variable CUDA Programming and Performance	2	2676	February 27, 2008
Shared variable CUDA Programming and Performance	1	2306	January 29, 2009
Is it mandatory to use shared memory in the kernel CUDA Programming and Performance	9	4246	October 11, 2010
Wierd thing in Shared Memory Looking for an explanation CUDA Programming and Performance	1	3495	January 28, 2011
I want to calculate the sum of the 512 lines CUDA Programming and Performance	16	2014	January 4, 2013

Shared Mem (w/ & w/out extern)

Related topics