Global memory double pointer problem


I’m developing certain very big program using CUDA.

My program requires a lot of memories (for matrix computations). So I’ve allocated global memories for them and passed their addresses into kernels.

However, the number of parameters are many, so I grouped them into a struct.

Part of my code is below:

struct GroupParameter {

   float *a;

   float *b;




main() {


   GroupParameter *param;

   cudaMalloc((void**)&param, sizeof(GroupParameter));

   cudaMalloc((void**)&param->a, sizeof(float) * 100);			 // (1)

   cudaMalloc((void**)&param->b, sizeof(float) * 100);

   mykernel<<<dimGrid, dimBlock>>> (param);



__global__ mykernel (GroupParameter *param)


	//(param->a) == 0		 [my problem here]				  (2)


When I printed “param->a” right after the allocation in host, I might get proper address of .

However, I always get 0 when I somehow print the value of “param->a” in myKernel (I stored the value in another parameter to access in host, then I printed it in host).

What is the problem?

There are two suspicious things:

  1. “param->a” in host function (1) is possible? Because “param” points to certain global memory. As far as I know, the host function cannot access global memory.

  2. “param->a” in kernel (2) is possible? Because, I heard that although I allocated global memories to the “param”, kernels’ parameters are stored in shared memory.

So, does anybody know how I can access the global memories ( *a and *b ) properly through *(param->a), *(param->b) in kernel?

Well, the first argument to cudaMalloc is supposed to be a pointer stored on the host, but you’re treating it as if it’s stored on the device - where’s the cudaMemcpy to copy the location returned by the cudaMalloc call for param->a into the param structure on the device? Alternatively, this is a good example of why you don’t want to go pointer-chasing on the GPU…

Yes, the problem is that your reference param->a points to some random memory on the host and not on device as you would expect. Your program should crash at that point ;)

I see three options for you.

    use GroupParameter, not a pointer to it. It will be allocated on host and passed to mykernel through shared memory.

__global__ mykernel (GroupParameter param)

However, that will reduce the size of your shared memory, and it may be even impossible if GroupParameter is too big!

Make a host and device copy of GroupParameter, as follows:

GroupParameter hostParam;

GroupParameter *gpuParam;

cudaMalloc((void**)&hostParam->a, sizeof(float) * 100);			 

cudaMalloc((void**)&hostParam->b, sizeof(float) * 100);

cudaMalloc((void**)&gpuParam, sizeof(GroupParameter));


mykernel<<<dimGrid, dimBlock>>> (gpuParam);

However, each time you access the arrays, you will be loading from device memory. That will slow your program.

Use constant memory for your GroupParameter:

__constant__ GroupParameter gpuParam;


GroupParameter hostParam;

cudaMalloc((void**)&hostParam->a, sizeof(float) * 100);			 

cudaMalloc((void**)&hostParam->b, sizeof(float) * 100);

cudaMalloc((void**)&gpuParam, sizeof(GroupParameter));


mykernel<<<dimGrid, dimBlock>>> ();

This way, after first reference to your arrays, gpuParam will be cached. I believe this is fastest way to do it without using shared memory. Note, however, if you use this metod you cannot swap pointers later in the kernel.

With this option you will also get lots of ‘Advisory: cannot tell where the pointer is pointing to, assuming global’. Just ignore those, because you are pointing to global (and not shared), but this can be annoying…


I will use the second option, because unfortunately I have to change pointers inside myKernel.

p.s. In your second suggestion, I think “hostParam->a” should be “hostParam.a”.

I’m ready to suffer the performance penalty.

Be warned: swapping the pointers, if you plan to launch several blocks in a grid may be a dangereous operation, because you cannot synchronize between the blocks! Imagine the following situation:

__global__ void myKernel(GroupParameter *params) {

__shared__ float sh[256];





it may happen that block 1 swaps params->a and params->b, while block 2 is reading data to sh array. It may even happen that, for example, 64 floats are read from params->a, while remaining 192 are read from params->b! So be very careful about that!

If your big algorithm works in a pattern:

  • read from a

  • do computation

  • store to b

  • swap a and b

I would suggest calling the kernel for each iteration instead.