Hello everyone,
I have a little problem with the extern qualifier…
The kernel below sums the elements of a vector of 140 integers. It’s values begins at 1 all the way 140 (so the sum is 9870, n(n+1)/2)
However it only produces the correct result when I explicitly state the amount of static shared memory. If I want it to have the smem pseudo-dynamically allocated like in the code below, i get the wrong answer. In fact I get 36, which means it only sums together the first 8 numbers of the array.
Now please note, I’m using ONE block w/ only ONE thread for conceptual simplicity. I’m well aware that this is ridiculously inefficient BUT inefficiency shouldn’t change the answer. Since there is only 1 thread, there is no competition for the memory resources. It’s essentially 1 alu doing all the computation, right? (i’m also aware of the sdk example, but the src code implementing the smem isn’t so straightforward)
Thanks for your attention!
[codebox]#include <stdio.h>
#include <cuda.h>
global void reduction_kernel(int *data, int *output)
{
int tid = threadIdx.x;
int result, j;
extern __shared__ int sdata[];
//__shared__ int sdata[140];
for ( j = 0 ; j < 140 ; j ++ )
sdata[tid+j] = data[tid+j];
result = 0;
for ( j = 0 ; j < 140 ; j ++ )
result += sdata[tid+j];
output[tid] = result;
}
int main()
{
int i, num_elements = 140;
int *h_data, *h_output;
int *d_data, *d_output;
int data_size = num_elements * sizeof(int);
int output_size = 1 * sizeof(int);
h_data = (int*)malloc(data_size);
h_output = (int*)malloc(output_size);
cudaMalloc((void **) &d_data, data_size);
cudaMalloc((void **) &d_output, output_size);
// fill host vector
for ( i = 0 ; i < num_elements ; i ++ )
h_data[i] = i+1;
// transfer host vector to device
cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);
// execute kernel
reduction_kernel <<< 1, 1 >>> ( d_data, d_output );
// transfer device result to host
cudaMemcpy(h_output, d_output, output_size, cudaMemcpyDeviceToHost);
printf("Result = %d\n", h_output[0]);
}[/codebox]