I’m writing a piece of code where it was good to use shared memory in a += manner. That is, not assigning any values to the smem before reading from it.
What i discovered was that the smem was not cleared between blocks. Since i was doing it in a += manner the values computed in past blocks were passed on and piling up.
Maybe not extremely strange that the smem wasnt cleared between blocks (even though they should have been run on different SMs !!! ). Another shock came when i did several consecutive runs, the values left from testrun 1 were still on the GPU piling up on testrun number 2!! The smem was never cleared between separate runs!
This led me to restart the computer, hoping that at least then the GPU should clear the smem. But no, the values were still there in smem (probably need to leave the pc off for a while).
Is it supposed to work like this? I find it disturbing! :wacko:
What’s disturbing about that? AFAICT, you didn’t explicitly initialise to zero, so this is expected behaviour. Why should CUDA behave any differently from C/C++, Fortran, or a host of other languages? Indeed since “C for CUDA” is explicitly an extension to C, behaving in any other way would be horribly inconsistent.
Reading uninitialized shared memory is undefined which means it’s perfectly acceptable that you’d find any kind of garbage there - including remains of previous kernels. Zeroing smem would cost and would likely increase kernel launch overhead.
Well, if it was a ‘warm’ restart on a GPU which wasn’t used for the display, it’s not completely surprising - the GPU probably never powered down, or performed any other tasks…
Thinking about it, that’s probably correct - after all, the Fortran standard states that new variables are created with undefined values, but most compilers I’ve seen will quietly do a zero initialisation unless specifically instructed (this can be a delightful source of bugs, particularly when people don’t believe in IMPLICIT NONE). However, such behaviour for CUDA would be a bit odd - within the same source file, you’d have host functions compiled with gcc without zero initialisation and device functions which would do zero initialisation.
That’s not the case at all. When I run that program (with default compiler options) I seem to get a value of 1 each time (on my computer). Throw in ‘-O -Wall’ and gcc complains. This is nothing to do with ‘variables not being freed outside of their context;’ it’s to do with a fresh set of variables happening to reuse the same batch of transistors.
ok, well if that was my shared memory example, you would not get 1 everytime you ran the program. Ex:
Run #1, you would get 1.
Run #2, you would get 2.
Run #3, you would get 3.
I’m not entirely sure you guys have understood my question here.
I think that you think ( :D ) that i’m surprised why i’m not getting the same random garbage everytime, that is not the case. The issue is that I’m getting the same value that my variable was set to at the end of the kernel, even if rerun the application, or restart my computer. Clear ? :)
Perfectly clear - what appears to be unclear here is the meaning of ‘undefined’ and its implications for computer programs.
As is very common in most languages, CUDA does not define what values variables hold at their declaration. It is undefined which basically means that it can be anything the compiler feels like storing, or (in this case) not storing anything, and simply leaving the memory set to whatever its last value happened to be. None of us are remotely suprised that the ‘random garbage’ you get back happens to be the last value stored in the variable from a previous call. Indeed, you should be grateful that the compiler did not decide to do something less pleasant. That the value remains the same across a restart is mildly suprising, but I even gave some suggestions as to why that might be the case. In the example I gave, gcc on my machine happened to decide that ‘undefined’ variables would be zero. That is the particular implementation decided to make this definition - hence I was called out for suggesting that zero initialisation would be ‘horribly inconsistent’ since when behaviour is undefined by the standard, implementations can do whatever they like, not tell you what they do, and change what they do as often as they please (as a side note, if behaviour is ‘implementation dependent’ then implementations are at least required to document what they do).
What puzzles me is that you are surprised about the results you are getting. If you expect that some benevolent higher being (OS, compiler) is clearing all your memory prior to running your program/kernel, you are on the wrong boat. It is DIY, you want your memory cleared (i.e. set to zero), do it first thing, before doing anything else. You might not want += semantics (please compiler val=0 before I start), but *= semantics in which case you would rather have your memory “cleared” to 1, calculating n! by recursion with a start value of 0 isn’t very exciting. The compiler will of course know what you want?
If the CUDA startup code would have to clear all memory prior to kernel launch, that would increase startup time. Why clear shared memory - you might not use it after all, why clear hundreds of MB of global memory, if you might use just a few of them.
set to at the end of the kernel, even if rerun the application,
That is not surprising at all, since you did “initialize” it beforehand - with your previous kernel.
or restart my computer. Clear ? :)
Hard restart (power off) or soft reboot? At a hard restart I would expect DRAM to possibly change value (or maybe not, I am not an EE), but SRAM would just keep its old value, between reboots. Otoh, the bits might just remain in their last position between reboots, which would explain the behavior you are observing.
You should drop you expectation that you memory is in a defined state prior to running code - any code. Otoh. if smem=0 before first kernel and smem=1 after first kernel, smem is in a defined state before second kernel - namely smem=1. Why should it suddenly be =0 again. Usually you might run many kernels, after initializing CUDA (which will probably not zero smem and global mem), and it is, as a matter of fact a feature:
send data to GPU, run kernel 1
run kernel2 with previously sent data or data produced by kernel 1
If memory were reset between kernels, there would be an awful lot of unnecessary traffic on the PCI bus, because you might have to resend data or save data that you want to use in another kernel. And you would never be able to split an algorithm into several kernels, where kernel 2 relies on results from kernel 1, if memory were “initialized” between kernels.
Yes it does - the behaviour is undefined in C, so the behaviour you’ve described could happen in C. I doubt it would on any modern OS with virtual memory - you’d probably have to go to an embedded system, and even then need a specific compiler and hardware and combination to see it.
As I stated, this is not possible - the behaviour is undefined, so you don’t know what’s going to happen. You’ve found behaviour which happens to occur for the current generation of GPUs with the current version of nvcc. This could change with the next minor, utterly insignificant bug fix to nvcc from NVIDIA. Or it might now.
Put simply, the same thing is happening on both the CPU and GPU, but ‘same’ in this context does not mean ‘initialise to zero’ or ‘reuse last value.’ In this context, the CPU and GPU doing the same thing means do whatever you want.
Yes i can definetly live with that explanation, its the only one that makes sense.
Well since i’ve never seen this behaviour on CPU code it surprises me.
Of course not, this is memory that you have allocated.
There is no previous kernel.
Yes, i do this with things that i have place in global memory. I did not think that it was possible with on-chip memory (except for constant memory, different scope), and it is not an obvious feature. If it is mentioned in the PG please refer me to it.
This only makes sense to me when dealing with global memory. You are implying that i can write code and run for example these two kernels:
__global__ static void kernel_1()
__shared__ float a;
a = 1;
__global__ static void kernel_2()
__shared__ float a;
a += 1; // Producing the result of 2
And get the result of 2. This is to me very surprising behaviour.
What also happens is that you may then access the same shared memory in different blocks. I can live with that if the blocks are both executed on the same SM. But when i was running a simple example i did it with only 2 blocks, block #2 had access to the shared memory which hade been worked on in block #1. Shouldn’t these blocks have been placed on different SMs ?
My initial surprise came when i was running my program once, and getting the correct result. 10 minutes later i came back and ran it again yielding a different result. Obviously the memory stored in smem was never cleared between the 2 runs, to say this is extemely obvious i can’t agree with.
Perhaps you should consider this to have expanded your experience.
Of course it isn’t in the programming guide, beyond the statement that reading from uninitialised variables results in undefined behaviour (if they every bother to make such an obvious statement).
Getting the result of ‘2’ is mildly suprising, and has taught you that shared memory is not zeroed when a block starts. You should be relieved that writing a program like that didn’t cause nvcc to melt your GPU to silicon slag, on the grounds that anyone writing such dangerous code ought not to be trusted with such a powerful device. That would fall within the acceptable range of ‘undefined’ behaviour too.
Why should the two blocks have run on different SMs? The programming guide explicitly states that you should not rely on any scheduling order or configuration?
Why is is ‘obvious’ that it should have been cleared?