I wrote a small program like this in device-emulation mode.
" deviceshared int num;
global
void binomial_kernel(void)
{
num ++;
printf(“GPU Hello World! Num = %d\n”, num);
}
"
And invoked it as
"
grid.x = 2;
grid.y = 2;
block.x = 16;
binomial_kernel <<< block , grid , 16 >>>();
"
When I run the app, I see that the numbers from 1 to 64 are printed.
Now, Shared Memory is common/shared only to a BLOCK, Am I right?
How is it that all the blocks are able to see the same variable in device emulation mode ???
I would expect prints of 1 to 16 , 4 times.
Is my understanding of shared memory wrong?
Ok, If multiple blocks are scheduled in same multiprocessor-- Do they share the shared-mem variables??? – I would NOT think so.
In practice, people put the shared variables inside the function declaration, and then each block definitely should see different memory locations. I actually don’t know what the meaning of a shared variable defined in the global scope would be, since there would be no way for all blocks on the real hardware to see the same shared memory. This may be a bug in the emulation, or the compiler for not throwing an error.
Shared memory variables canNOT be initialized. Pg 32/125 ,NVIDIA CUDA PG 1.0
The problem with the emulation is that all the BLOCKS are seeing the same variable. This contradicts the condition that “Shared” variables are shared only among the threads of a block.
You may see this code as “odd”. But executing this code on a GPU (the printf replaced with a store of an integer in global memory reserved for the block) helped me understand many things about how “warps” execute within a MP, how blocks are scheduled within “mp” and so on.
Yes, I understood your point after posting my comments. Agreed.
My code did NOT work as expected. THat is why I am posting this question.
The value of “num” is NOT preserved between different blocks. Code executing in different blocks are updating the same variable in device emulation model. This is the “emulation bug” that we are talking about.
Right. It was different. But I was able to account for the GPU behaviour. In the GPU case too, the memory was 0 at run-time – fortunately.
You were right about the “initialization” part. If I include this, the device emulation works correctly.
So, This also shows that “device emulation” internally uses the same-area-of-memory for implementing “shared memory” of all the “blocks” in a program. This also means that the device emulation executes the various “blocks” of a kernel too in a sequential manner.