You cannot do that. By the time the kernel fun_tmpp has ended, the device variable tmpp is not there anymore. If you need permanent device memory, allocate it with cudaMalloc which writes the device mem pointer to a host variable you provide via a host mem pointer. Then memcpy to it.
The setfun_tmpp overwrites the device mem pointer you got from cudaMalloc with the (uninitialized) pointer value of device tmpp. So the following kernel can only restore a crap pointer value and you have a memory leak of the cudaMalloc’ed mem.
@wumpus, yeah, the cleanup removes all used variables. You cannot use a global var declaration to transport data between kernels. You need to use cudaMalloc to get persistent storage.
The trick is: You can’t use & to get the address of a device variable.
nvcc compiles a device memory into a device variable and a dummy host variable of the same type. Whenever you refer to it in host code, you refer to the dummy host variable (whose sole purpose may be getting correct sizeof). That’s why we have to use cudaMemCpyToSymbol: the true address has to be deduced from the dummy host variable’s address, not just a &.
You can get the correct pointer via cudaGetSymbolAddress(), or just convert to driver API like I did.
So you are saying that global device variables stay around after the kernel invocation? Last time I checked this, it didn’t work (that was CUDA 0.8 though). I only use it for constant stuff. Maybe it works now with r/w variables. asadafag, can you confirm that?
I tried your example code, but it’s strange that, sometimes the value of printf is 0. and I think we should not print the address directly, because the pointer p is a device pointer, and we should first copy it to host and then printf it? am I right?