Is it mandatory to use shared memory in the kernel

sharath · October 3, 2010, 9:29am

I have written a program of sum reduction for multiple blocks. Input should be in power of 2. It works fine only when i use shared memory in the kernel. try removing shared memory in the declaration u wont get correct answer. The answer is zero because i have initialized device array as 0 in the first place.

use int perblk[512]; rather shared int perblk[512];

I dont think initializing shared memory is mandatory i have written other programs without it, please help me with this…

I am posting the code here and i have also attached the file.

[codebox]/* Program on sum reduction, for multiple blocks

input array size should be in power of 2

*/

#include <stdio.h>

#include <cuda.h>

#include <time.h>

#define CLK_PER_SEC 1000.0

/* Program works correcly only if shared memory is used in the kernel,

else the output array will be zero */

global void redsum(int *perblocks,int *cd)

{

shared int perblk[512];

//int perblk[512];

int t=threadIdx.x;

int col=blockIdx.x*blockDim.x+threadIdx.x;

perblk[t]=perblocks[col];

__syncthreads();

for(int stride=blockDim.x/2;stride>=1;stride/=2)

{

__syncthreads();

if(t<stride)

perblk[t]+=perblk[t+stride];

}

cd[blockIdx.x]=perblk[0]; // sum of every block is stored in auxillary array

}

main()

{

int *partialsumh,*cd,*partialsumd,n,i;

float time2;

clock_t starth, endh;

cudaEvent_t start,stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

printf("Enter array size: ");

scanf(“%d”,&n); //reading array size, needs to be in power of 2

size_t size = sizeof(int)*n;

partialsumh=(int*)malloc(size);

starth=clock();

for(i=0;i<n;i++)

 partialsumh[i]=i+1;  // initializing array 1,2,3...

printf("\nThe array is: ");

for(i=0;i<n;i++)

printf("%d ",partialsumh[i]);

printf(“\n”);

int result=0;

for(i=0;i<n;i++)

result = result + partialsumh[i];  // finding sum by CPU

endh=clock();

float timing=endh-starth;

printf(“\nTime taken by CPU is %.2fms”,timing);

cudaMalloc((void**)&partialsumd,size);

cudaMalloc((void**)&cd,size);

cudaMemset(partialsumd,0,size); //initialize device array to 0

cudaMemset(cd,0,size); //initialize device array to 0

cudaMemcpy(partialsumd,partialsumh,size,cudaMemcpyHostToDevi

ce);

int nb=n/64+((n%64==0)?0:1); // each thread block has 64 threads

cudaEventRecord(start,0);

redsum<<<nb,64>>>(partialsumd,cd);

cudaThreadSynchronize();

cudaMemset(partialsumd,0,size);

cudaMemcpy(partialsumd,cd,size,cudaMemcpyDeviceToDevice);

cudaMemset(cd,0,size);

/* the program works for any number of inputs, the auxillary is computed

repeatedly using do while loop. this loops till array is reduced to single thread block */

do{

nb=(nb/64==0)?1:nb/64;

redsum<<<nb,64>>>(partialsumd,cd);

cudaMemset(partialsumd,0,size);

// After second computation, copy auxillary to main array

cudaMemcpy(partialsumd,cd,size,cudaMemcpyDeviceToDevice);

cudaMemset(cd,0,size);

}while(nb>=2);

//copy final answer to the host

cudaMemcpy(partialsumh,partialsumd,size,cudaMemcpyDeviceToHo

st);

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time2,start,stop);

printf("\nThe sum computed by CPU is %d ", result);

printf(“\nTime taken by GPU is %.2fms”, time2);

printf(“\nthe sum computed by GPU is %d \n\n”, partialsumh[0]);

return 0;

}[/codebox]

sharath · October 3, 2010, 9:29am

I have written a program of sum reduction for multiple blocks. Input should be in power of 2. It works fine only when i use shared memory in the kernel. try removing shared memory in the declaration u wont get correct answer. The answer is zero because i have initialized device array as 0 in the first place.

use int perblk[512]; rather shared int perblk[512];

I dont think initializing shared memory is mandatory i have written other programs without it, please help me with this…

I am posting the code here and i have also attached the file.

[codebox]/* Program on sum reduction, for multiple blocks

input array size should be in power of 2

*/

#include <stdio.h>

#include <cuda.h>

#include <time.h>

#define CLK_PER_SEC 1000.0

/* Program works correcly only if shared memory is used in the kernel,

else the output array will be zero */

global void redsum(int *perblocks,int *cd)

{

shared int perblk[512];

//int perblk[512];

int t=threadIdx.x;

int col=blockIdx.x*blockDim.x+threadIdx.x;

perblk[t]=perblocks[col];

__syncthreads();

for(int stride=blockDim.x/2;stride>=1;stride/=2)

{

__syncthreads();

if(t<stride)

perblk[t]+=perblk[t+stride];

}

cd[blockIdx.x]=perblk[0]; // sum of every block is stored in auxillary array

}

main()

{

int *partialsumh,*cd,*partialsumd,n,i;

float time2;

clock_t starth, endh;

cudaEvent_t start,stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

printf("Enter array size: ");

scanf(“%d”,&n); //reading array size, needs to be in power of 2

size_t size = sizeof(int)*n;

partialsumh=(int*)malloc(size);

starth=clock();

for(i=0;i<n;i++)

 partialsumh[i]=i+1;  // initializing array 1,2,3...

printf("\nThe array is: ");

for(i=0;i<n;i++)

printf("%d ",partialsumh[i]);

printf(“\n”);

int result=0;

for(i=0;i<n;i++)

result = result + partialsumh[i];  // finding sum by CPU

endh=clock();

float timing=endh-starth;

printf(“\nTime taken by CPU is %.2fms”,timing);

cudaMalloc((void**)&partialsumd,size);

cudaMalloc((void**)&cd,size);

cudaMemset(partialsumd,0,size); //initialize device array to 0

cudaMemset(cd,0,size); //initialize device array to 0

cudaMemcpy(partialsumd,partialsumh,size,cudaMemcpyHostToDevi

ce);

int nb=n/64+((n%64==0)?0:1); // each thread block has 64 threads

cudaEventRecord(start,0);

redsum<<<nb,64>>>(partialsumd,cd);

cudaThreadSynchronize();

cudaMemset(partialsumd,0,size);

cudaMemcpy(partialsumd,cd,size,cudaMemcpyDeviceToDevice);

cudaMemset(cd,0,size);

/* the program works for any number of inputs, the auxillary is computed

repeatedly using do while loop. this loops till array is reduced to single thread block */

do{

nb=(nb/64==0)?1:nb/64;

redsum<<<nb,64>>>(partialsumd,cd);

cudaMemset(partialsumd,0,size);

// After second computation, copy auxillary to main array

cudaMemcpy(partialsumd,cd,size,cudaMemcpyDeviceToDevice);

cudaMemset(cd,0,size);

}while(nb>=2);

//copy final answer to the host

cudaMemcpy(partialsumh,partialsumd,size,cudaMemcpyDeviceToHo

st);

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time2,start,stop);

printf("\nThe sum computed by CPU is %d ", result);

printf(“\nTime taken by GPU is %.2fms”, time2);

printf(“\nthe sum computed by GPU is %d \n\n”, partialsumh[0]);

return 0;

}[/codebox]

tera · October 3, 2010, 10:25am

No, it isn’t.
[font=“Courier New”]int perblk[512];[/font] declares an array of 512 ints per thread. Which subsequently fails, as most of them are uninitialized.

Note that in case of a reduction, you don’t need to go through any local variables at all. Just sum them straight from global memory. For even more optimizations, look at the SDK example.

tera · October 3, 2010, 10:25am

No, it isn’t.
[font=“Courier New”]int perblk[512];[/font] declares an array of 512 ints per thread. Which subsequently fails, as most of them are uninitialized.

Note that in case of a reduction, you don’t need to go through any local variables at all. Just sum them straight from global memory. For even more optimizations, look at the SDK example.

mayouuu · October 4, 2010, 6:48am

Exactly, you have one int perblk[512] per thread.

mayouuu · October 4, 2010, 6:48am

Exactly, you have one int perblk[512] per thread.

sharath · October 4, 2010, 1:51pm

well i changed the size to

int perblk[64];

and still it wont work without initializing shared memory. give it a try with shared memory and without shared memory.

I am not worried about this program, i have same issue with other program also and i am posting this as example.

if the program is executed without using shared memory, initialized device array wont undergo changes made in the kernel it will be the same.

sharath · October 4, 2010, 1:51pm

well i changed the size to

int perblk[64];

and still it wont work without initializing shared memory. give it a try with shared memory and without shared memory.

I am not worried about this program, i have same issue with other program also and i am posting this as example.

if the program is executed without using shared memory, initialized device array wont undergo changes made in the kernel it will be the same.

tera · October 11, 2010, 3:15am

My point was not the number of elements in the array.

Shared memory and local memory aren’t just two different areas of memory which can be used interchangeably. A variable in shared memory is the same for every thread in the block, while a variable in local memory is different for every thread in the block.

tera · October 11, 2010, 3:15am

My point was not the number of elements in the array.

Shared memory and local memory aren’t just two different areas of memory which can be used interchangeably. A variable in shared memory is the same for every thread in the block, while a variable in local memory is different for every thread in the block.

Topic		Replies	Views
problem with shared mamery CUDA Programming and Performance	4	3247	May 11, 2009
Some confusion on using shared memory. CUDA Programming and Performance	26	9443	June 2, 2009
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9729	April 6, 2010
shared memory example to be found v.2 regarding example problem : scalarProd CUDA Programming and Performance	0	916	September 20, 2010
help getting shared memory working CUDA Programming and Performance	11	4429	June 12, 2007
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1908	January 14, 2009
Vector Reduction CUDA Programming and Performance	3	19876	March 9, 2011
shared memory example to be found easy example for vector dot product CUDA Programming and Performance	18	3811	September 17, 2010
shared memory problem usage in variables CUDA Programming and Performance	8	2599	September 22, 2010
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8397	April 15, 2011

Is it mandatory to use shared memory in the kernel

Related topics