problem with __shared__ on device emulator

Here is a sample code I was trying to run
#include<cuda.h>
#include<stdio.h>

global void myKernel()
{
device shared float a[1] ; // Is an array, must be stored in global memory
device shared float b; // Stored in shared memory
a[0]++;
b++;
printf(“a[0]=%f b=%f\n”,a[0],b);
}
main()
{
dim3 myGrid(2,1,1);
dim3 myBlock(256,1,1);
myKernel<<<myGrid,myBlock>>>();
}

When I run this code on an device emulator using command –
$ nvcc -deviceemu -run copy.cu

The output I get is of form
a[0]=1.000000 b=1.000000
a[0]=2.000000 b=2.000000
a[0]=3.000000 b=3.000000
a[0]=4.000000 b=4.000000
.
.
.
a[0]=508.000000 b=508.000000
a[0]=509.000000 b=509.000000
a[0]=510.000000 b=510.000000
a[0]=511.000000 b=511.000000
a[0]=512.000000 b=512.000000

Now, for array a, I know the memory is allocated on global memory and hence same variable will be updated by each thread and hence the result is as expected. However the shared variable b is allocated on shared memory and hence its value is shared by threads of a block only and hence the expected result was that threads of both block increment their own copy and we get value of ‘b’ ranging from 1 to 256, each value occuring twice. I dont have a NVIDIA card yet and
thus is bound to use the emulator.

Is there any error in my understanding or its a bug in emulator ??
Can some one please run the same code on a machine with NVIDIA device and reply me the result, I shall be grateful.

First, I’m not sure that a will reside in global memory. According to the programming guide (page 21), the shared qualifier, optionally used together with device, declares a variable that resides in the shared memory space of a thread block. Where are you reading that it should be in global memory?

Second, the variable b is not initialized, and it might not start out as zero. It may not be reset to zero between blocks, so if both blocks were executed sequentially on the same multiprocessor, variable b might retain the value of 256 when beginning the second block. I am also unsure of the behavior on processors without the atomic increment instructions. If all threads within a warp were to load the value (and get the same value) and increment it, and then attempt to store, the value might be incremented far fewer times than expected. Even with the atomic instructions, I am unsure whether the “++” operator will necessarily invoke the atomic instructions or whether you would have to call atomicAdd() explicitly in order to get that behavior.