Shared Mem (w/ & w/out extern)

Hello everyone,

I have a little problem with the extern qualifier…

The kernel below sums the elements of a vector of 140 integers. It’s values begins at 1 all the way 140 (so the sum is 9870, n(n+1)/2)

However it only produces the correct result when I explicitly state the amount of static shared memory. If I want it to have the smem pseudo-dynamically allocated like in the code below, i get the wrong answer. In fact I get 36, which means it only sums together the first 8 numbers of the array.

Now please note, I’m using ONE block w/ only ONE thread for conceptual simplicity. I’m well aware that this is ridiculously inefficient BUT inefficiency shouldn’t change the answer. Since there is only 1 thread, there is no competition for the memory resources. It’s essentially 1 alu doing all the computation, right? (i’m also aware of the sdk example, but the src code implementing the smem isn’t so straightforward)

Thanks for your attention!

[codebox]#include <stdio.h>

#include <cuda.h>

global void reduction_kernel(int *data, int *output)


int tid = threadIdx.x;

int result, j;	


extern __shared__ int sdata[];

//__shared__ int sdata[140];

for ( j = 0 ; j < 140 ; j ++ )

	sdata[tid+j] = data[tid+j];


result = 0;

for ( j = 0 ; j < 140 ; j ++ )

	result += sdata[tid+j];

output[tid] = result;


int main()


int i, num_elements = 140;

int *h_data, *h_output;

int *d_data, *d_output;

int data_size = num_elements * sizeof(int);

int output_size = 1 * sizeof(int);

h_data   = (int*)malloc(data_size);

h_output = (int*)malloc(output_size);

cudaMalloc((void **) &d_data,  data_size);

cudaMalloc((void **) &d_output, output_size);

// fill host vector

for ( i = 0 ; i < num_elements ; i ++ )

	h_data[i] = i+1;

// transfer host vector to device

cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);

// execute kernel

reduction_kernel <<< 1, 1 >>> ( d_data, d_output );

// transfer device result to host

cudaMemcpy(h_output, d_output, output_size, cudaMemcpyDeviceToHost);

printf("Result = %d\n", h_output[0]);


If you use the extern qualifier on shared memory, you required to set the third argument in the kernel launch configuration, specifying how many bytes of shared memory you want per block. This can be a value computed at runtime, but it has to be there.

thanks seibert, that solved it!