Error on device emulator problem with __shared__

Here is a sample code I was trying to run

global void myKernel()
device shared float a[1] ; // Is an array, must be stored in global memory
device shared float b; // Stored in shared memory
printf(“a[0]=%f b=%f\n”,a[0],b);
dim3 myGrid(2,1,1);
dim3 myBlock(256,1,1);

When I run this code on an device emulator using command –
$ nvcc -deviceemu -run

The output I get is of form
a[0]=1.000000 b=1.000000
a[0]=2.000000 b=2.000000
a[0]=3.000000 b=3.000000
a[0]=4.000000 b=4.000000
a[0]=508.000000 b=508.000000
a[0]=509.000000 b=509.000000
a[0]=510.000000 b=510.000000
a[0]=511.000000 b=511.000000
a[0]=512.000000 b=512.000000

Now, for array a, I know the memory is allocated on global memory and hence same variable will be updated by each thread and hence the result is as expected. However the shared variable b is allocated on shared memory and hence its value is shared by threads of a block only and hence the expected result was that threads of both block increment their own copy and we get value of ‘b’ ranging from 1 to 256, each value occuring twice. I dont have a NVIDIA card yet and
thus is bound to use the emulator.

Is there any error in my understanding or its a bug in emulator ??

Can some one please run the same code on a machine with NVIDIA device and reply me the result, I shall be grateful,


are you sure that the array must be stored in global memory and b in shared memory? why do you say that?

in my opinion both must be stored in shared memory. if you want give me the original code, I have nvidia 9800GT.

One cannot printf from within a kernel on the device.

But I still won’t run it. It is pointless too. 1) You aren’t initializing your shared memory. Thus any data you will get out is bound to be completely random (whether on the host or device). 2) You have race conditions with all threads trying to do variable++ at once. The results are also undefined because of this.