Here is a sample code I was trying to run
#include<cuda.h>
#include<stdio.h>
global void myKernel()
{
device shared float a[1] ; // Is an array, must be stored in global memory
device shared float b; // Stored in shared memory
a[0]++;
b++;
printf(“a[0]=%f b=%f\n”,a[0],b);
}
main()
{
dim3 myGrid(2,1,1);
dim3 myBlock(256,1,1);
myKernel<<<myGrid,myBlock>>>();
}
When I run this code on an device emulator using command –
$ nvcc -deviceemu -run copy.cu
The output I get is of form
a[0]=1.000000 b=1.000000
a[0]=2.000000 b=2.000000
a[0]=3.000000 b=3.000000
a[0]=4.000000 b=4.000000
.
.
.
a[0]=508.000000 b=508.000000
a[0]=509.000000 b=509.000000
a[0]=510.000000 b=510.000000
a[0]=511.000000 b=511.000000
a[0]=512.000000 b=512.000000
Now, for array a, I know the memory is allocated on global memory and hence same variable will be updated by each thread and hence the result is as expected. However the shared variable b is allocated on shared memory and hence its value is shared by threads of a block only and hence the expected result was that threads of both block increment their own copy and we get value of ‘b’ ranging from 1 to 256, each value occuring twice. I dont have a NVIDIA card yet and
thus is bound to use the emulator.
Is there any error in my understanding or its a bug in emulator ??
Can some one please run the same code on a machine with NVIDIA device and reply me the result, I shall be grateful.