Shared Memory modelling in Device Emulation Mode! Understanding DeviceEmu realities

I wrote a small program like this in device-emulation mode.

"
device shared int num;

global
void binomial_kernel(void)
{
num ++;
printf(“GPU Hello World! Num = %d\n”, num);
}
"

And invoked it as

"
grid.x = 2;
grid.y = 2;
block.x = 16;

binomial_kernel <<< block , grid , 16 >>>();

"

When I run the app, I see that the numbers from 1 to 64 are printed.

Now, Shared Memory is common/shared only to a BLOCK, Am I right?
How is it that all the blocks are able to see the same variable in device emulation mode ???

I would expect prints of 1 to 16 , 4 times.

Is my understanding of shared memory wrong?

Ok, If multiple blocks are scheduled in same multiprocessor-- Do they share the shared-mem variables??? – I would NOT think so.

Thanks for any inputs.

None has replied even after 30 views.

Can some1 from NVIDIA please comment on this?

This just looks like the “Shared Memory” is NOT emulated well. May be, this should find a mention in the manual.

In practice, people put the shared variables inside the function declaration, and then each block definitely should see different memory locations. I actually don’t know what the meaning of a shared variable defined in the global scope would be, since there would be no way for all blocks on the real hardware to see the same shared memory. This may be a bug in the emulation, or the compiler for not throwing an error.

The same code works well on the GPU.

Now that you mentioned, even I wonder what would the scope of "shared " declared global… Probably it can be accessed by multiple-kernels!

What’s the problem with emultion? Shared memory isn’t initialized, it’s programmers responsibility to initialize it.

This is odd code. All threads from block increments same variable and I don’t know where it can be used.

Shared memory variables canNOT be initialized. Pg 32/125 ,NVIDIA CUDA PG 1.0

The problem with the emulation is that all the BLOCKS are seeing the same variable. This contradicts the condition that “Shared” variables are shared only among the threads of a block.

You may see this code as “odd”. But executing this code on a GPU (the printf replaced with a store of an integer in global memory reserved for the block) helped me understand many things about how “warps” execute within a MP, how blocks are scheduled within “mp” and so on.

By initializing I mean assignment of initial values at runtime.

Obviously, your code works in emulation because:

  1. num is initialized to 0.

  2. value of num preserved between different blocks.

This may be not true on GPU and this is why you get different results.

You can write something like

if( threadIdx.x == 0 ) num = 0;

at the beginning of your kernel.

In general, you should not made any assumptions about value of shared variable if you haven’t explicitly assigned value to it.

Yes, I understood your point after posting my comments. Agreed.

My code did NOT work as expected. THat is why I am posting this question.

The value of “num” is NOT preserved between different blocks. Code executing in different blocks are updating the same variable in device emulation model. This is the “emulation bug” that we are talking about.

Right. It was different. But I was able to account for the GPU behaviour. In the GPU case too, the memory was 0 at run-time – fortunately.

This was a new learning for me today. Thank you.

Andrei,

You were right about the “initialization” part. If I include this, the device emulation works correctly.

So, This also shows that “device emulation” internally uses the same-area-of-memory for implementing “shared memory” of all the “blocks” in a program. This also means that the device emulation executes the various “blocks” of a kernel too in a sequential manner.

Thanks for your inputs.

Best Regards,
Sarnath