How can Iget the pointer to the device memory var

How can I get the pointer to the device memory variable from the host function?

If I trace the following code, it works correctly, but if I start exe file (EmulDebug), I’ve got an error. Why?

__device__ unsigned int tmpp;

__global__ void fun_tmpp(unsigned int** tmp)

{

    int bx = blockIdx.x;

   int tx = threadIdx.x;

   if(bx==0 && tx==0)

    {

        *tmp = &tmpp;

    }

     __syncthreads();

};

extern "C" void runTest()

{

    unsigned int* tm;

   unsigned int tm1 = 90;

   fun_tmpp<<<1,1>>>(&tm);

//!!!BUG

    CUDA_SAFE_CALL(cudaMemcpy(tm, &tm1, sizeof(unsigned int),

                              cudaMemcpyHostToDevice) );

}

Thanks in advance.

You cannot do that. By the time the kernel fun_tmpp has ended, the device variable tmpp is not there anymore. If you need permanent device memory, allocate it with cudaMalloc which writes the device mem pointer to a host variable you provide via a host mem pointer. Then memcpy to it.

Peter

That’s quite a weird way to get the address of a device variable, but I don’t see how it neccesarily wouldn’t work.

AFAIK, you generally use ‘cudaMemCpyToSymbol’ for this.

Prkipfer: He declares the device globally, not locally, so it should remain existant between kernel calls, right?

Quite interesting, if you do this, you get an relocation entry in the cubin file

reloc  {

        name = xx

        segname = reloc

        segnum = 14

        offset = 0

        bytes = 4

}

This is how it gives the symbol a (fixed) place in global memory and communicates the location to the kernel.

Ok. So in my example address of “device unsigned int tmpp” is valid only for device, global functions, isn’t it?

But here I allocate memory with cudaMalloc():

__device__ unsigned int* tmpp;

__global__ void fun_tmpp(unsigned int** tmp)

{

    int bx = blockIdx.x;

   int tx = threadIdx.x;

if(bx==0 && tx==0)

{

        *tmp = tmpp;

}

     __syncthreads();

};

__global__ void setfun_tmpp(unsigned int** tmp)

{

    int bx = blockIdx.x;

   int tx = threadIdx.x;

if(bx==0 && tx==0)

{

        tmpp = *tmp;

}

     __syncthreads();

};

extern "C" void runTest()

{

    unsigned int* tm;

    unsigned int tm1 = 90;

   CUDA_SAFE_CALL(cudaMalloc((void**)&tm, sizeof(unsigned int)));

   setfun_tmpp<<<1,1>>>(&tm);

   fun_tmpp<<<1,1>>>(&tm);

   CUDA_SAFE_CALL(cudaMemcpy(tm, &tm1, sizeof(unsigned int),

                              cudaMemcpyHostToDevice) );

};

And it slill doesn’t work. :(

The setfun_tmpp overwrites the device mem pointer you got from cudaMalloc with the (uninitialized) pointer value of device tmpp. So the following kernel can only restore a crap pointer value and you have a memory leak of the cudaMalloc’ed mem.

@wumpus, yeah, the cleanup removes all used variables. You cannot use a global var declaration to transport data between kernels. You need to use cudaMalloc to get persistent storage.

Peter

The trick is: You can’t use & to get the address of a device variable.
nvcc compiles a device memory into a device variable and a dummy host variable of the same type. Whenever you refer to it in host code, you refer to the dummy host variable (whose sole purpose may be getting correct sizeof). That’s why we have to use cudaMemCpyToSymbol: the true address has to be deduced from the dummy host variable’s address, not just a &.
You can get the correct pointer via cudaGetSymbolAddress(), or just convert to driver API like I did.

Thanks, it works!

So you are saying that global device variables stay around after the kernel invocation? Last time I checked this, it didn’t work (that was CUDA 0.8 though). I only use it for constant stuff. Maybe it works now with r/w variables. asadafag, can you confirm that?

Peter

I used device var once in my kernel and it worked (in 1.0).

Just verified it.

#include <stdio.h>

__device__ int aaaa;

int main(){

	void *p=NULL;

	cudaGetSymbolAddress(&p,aaaa);

	printf("%p\n",p);

	return 0;

}

I tried your example code, but it’s strange that, sometimes the value of printf is 0. and I think we should not print the address directly, because the pointer p is a device pointer, and we should first copy it to host and then printf it? am I right?