problem with shared on device emulator

dessu · February 22, 2009, 9:20am

Here is a sample code I was trying to run
#include<cuda.h>
#include<stdio.h>

global void myKernel()
{
device shared float a[1] ; // Is an array, must be stored in global memory
device shared float b; // Stored in shared memory
a[0]++;
b++;
printf(“a[0]=%f b=%f\n”,a[0],b);
}
main()
{
dim3 myGrid(2,1,1);
dim3 myBlock(256,1,1);
myKernel<<<myGrid,myBlock>>>();
}

When I run this code on an device emulator using command –
$ nvcc -deviceemu -run copy.cu

The output I get is of form
a[0]=1.000000 b=1.000000
a[0]=2.000000 b=2.000000
a[0]=3.000000 b=3.000000
a[0]=4.000000 b=4.000000
.
.
.
a[0]=508.000000 b=508.000000
a[0]=509.000000 b=509.000000
a[0]=510.000000 b=510.000000
a[0]=511.000000 b=511.000000
a[0]=512.000000 b=512.000000

Now, for array a, I know the memory is allocated on global memory and hence same variable will be updated by each thread and hence the result is as expected. However the shared variable b is allocated on shared memory and hence its value is shared by threads of a block only and hence the expected result was that threads of both block increment their own copy and we get value of ‘b’ ranging from 1 to 256, each value occuring twice. I dont have a NVIDIA card yet and
thus is bound to use the emulator.

Is there any error in my understanding or its a bug in emulator ??
Can some one please run the same code on a machine with NVIDIA device and reply me the result, I shall be grateful.

Jamie_K · February 24, 2009, 6:13am

First, I’m not sure that a will reside in global memory. According to the programming guide (page 21), the shared qualifier, optionally used together with device, declares a variable that resides in the shared memory space of a thread block. Where are you reading that it should be in global memory?

Second, the variable b is not initialized, and it might not start out as zero. It may not be reset to zero between blocks, so if both blocks were executed sequentially on the same multiprocessor, variable b might retain the value of 256 when beginning the second block. I am also unsure of the behavior on processors without the atomic increment instructions. If all threads within a warp were to load the value (and get the same value) and increment it, and then attempt to store, the value might be incremented far fewer times than expected. Even with the atomic instructions, I am unsure whether the “++” operator will necessarily invoke the atomic instructions or whether you would have to call atomicAdd() explicitly in order to get that behavior.

Topic		Replies	Views
Error on device emulator problem with __shared__ CUDA Programming and Performance	3	1974	February 23, 2009
Shared Memory modelling in Device Emulation Mode! Understanding DeviceEmu realities CUDA Programming and Performance	8	6360	December 6, 2007
About __device__ __shared__ variable CUDA Programming and Performance	2	2752	February 27, 2008
Shared Memory Is my program correct ? CUDA Programming and Performance	2	6876	March 23, 2009
Shared Memory Again What is happening Here CUDA Programming and Performance	11	2271	June 5, 2009
how to add CUDA Programming and Performance	11	4048	May 27, 2008
shared memory problem usage in variables CUDA Programming and Performance	8	2551	September 22, 2010
Shared Memory Problem memory shared only within blocks? CUDA Programming and Performance	4	6016	February 8, 2008
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9662	April 6, 2010
Shared variable CUDA Programming and Performance	1	2346	January 29, 2009

problem with __shared__ on device emulator

Related topics

problem with shared on device emulator