shared memory

coderunner · January 30, 2009, 4:07pm

Hi,

I am a newbie to CUDA. I am not able to understand how to use the shared variable.

Could somebody help me in understanding these shared variables?

I wrote a simple code which counts number odd numbers in a given vector of 10000 elements.

I try to use the shared variable to count the number of odd numbers in each block. However, i am didn’t succeed in this. The code is working fine in Emulation mode but not in regular mode.

I want to return the total number instead of count in each block.

I greatly appreciate your help.

Thanks

SPWorley · January 30, 2009, 4:35pm

The process of returning the sum of something (like an odd number count) over all the threads is very common in CUDA programming. It’s called reduction.

There’s an excellent example of reduction in the CUDA SDK demonstration projects.

coderunner · January 30, 2009, 7:10pm

Thank you so much Worely, I don’t get it how to create a counter which can be incremented over all blocks. Could you kindly help me? I appreciate your help.

Here is my code.

global void check_gpu( int *ele, int *out, int no_ele)

{

long int idx=blockIdx.x*blockDim.x+threadIdx.x;

__shared__ int sele[250]; // Loaded from Global memory to shared memory

sele[threadIdx.x]=ele[idx];

__syncthreads();	



if(sele[threadIdx.x]%2==0) // Finding whether it is odd or even number

	out[idx]=1;

else

	out[idx]=0;

}

int main(int argc, char* argv)

{

time_t time1;

time1=time(NULL);



// Allocate memory for host	

int no_ele=100000;

int *ele_host;

ele_host=(int*)malloc(sizeof(int)*no_ele);

int *out_host;

out_host=(int*)malloc(sizeof(int)*no_ele);

for(int k=0; k<no_ele; k++){

	ele_host[k]=k+1;

}

// Memory allocation DEVICE

int *ele_dev; // Elements in device

cudaMalloc((void **) &ele_dev, sizeof(int)*no_ele);

int *out_dev;

cudaMalloc((void **) &out_dev, sizeof(int)*no_ele);

// Copy data from host memory to device memory 

cudaMemcpy(ele_dev, ele_host, sizeof(int)*(no_ele), cudaMemcpyHostToDevice);

// Configure Device threads and blocks

int no_blocks=400;

int no_threads=250;



check_gpu<<<no_blocks, no_threads>>>(ele_dev, out_dev, no_ele);



CUT_CHECK_ERROR("Kernel execution failed");

cudaMemcpy(out_host, out_dev, sizeof(int)*no_ele, cudaMemcpyDeviceToHost);

for(int k=0; k<no_ele; k++){

	printf("%d\n", out_host[k]);

}

time_t time3;

time3=time(NULL);

printf("%f\n", difftime(time3, time1));

cudaFree(ele_dev);

cudaFree(out_dev);	

free(ele_host);

free(out_host);

}

Topic		Replies	Views
Shared variable CUDA Programming and Performance	1	2306	January 29, 2009
CUDA programming - Help CUDA Programming and Performance	0	2799	January 29, 2009
About __device__ __shared__ variable CUDA Programming and Performance	2	2676	February 27, 2008
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1722	August 24, 2009
problem with shared mamery CUDA Programming and Performance	4	3176	May 11, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5448	April 30, 2010
Shared Mem (w/ & w/out extern) CUDA Programming and Performance	2	2294	October 2, 2009
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Problems with using shared memory CUDA Programming and Performance	7	6044	January 13, 2011

shared memory

Related topics