cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit

Everything is in the title:

I am doing some allocations using CUDA 4.0, Ubuntu 64 bits and a 470 GTX and for unexplained reasons, a cudaMalloc makes my program crashing.

This is a part of my code:

//nbprot = 3 and the lengths are about 10000

HANDLE_ERROR( cudaMallocHost((void**)&results,sizeof(int3 *)*nbprot));

HANDLE_ERROR( cudaMalloc((void**)&res_dev,sizeof(int3 *)*nbprot));

HANDLE_ERROR( cudaMalloc((void**)&tab_prot_dev,sizeof(char *)*nbprot));

for(i=0;i<nbprot;i++){

	HANDLE_ERROR(cudaMalloc((void**)&res_dev[i],sizeof(int3)*leng[i]));

	HANDLE_ERROR( cudaMalloc((void**)&tab_prot_dev[i],sizeof(char)*(leng[i]+1)));	

	HANDLE_ERROR( cudaMallocHost((void**)&results[i],sizeof(int3)*leng[i],cudaHostAllocDefault)); //page-locked memory

}

The error occurs during the first turn of the loop, at the first instruction.

It started since I have put some cudaMallocHost allocations so I think this is related but even when I remove that part, my cudaMalloc crashes. I changed the size of memory allocated and it still crashes.

I know that I am not using too much memory on my card since I do not make any other allocation before (and after) and my total amount of device memory allocated is about 2 Mo.

Maybe I have to make it an other way.

The key here is to think about where your pointers are pointing. You appear to be trying to make an array of pointers to other arrays on the device. The problem is that you first allocate the top-level array on the device, then try to pass pointers to those top-level memory locations to a host function. As soon as cudaMalloc() tries to write the pointer to the newly allocated memory segment in the loop, it dereferences a device pointer on the host and segfaults.

To allocate arrays like this, you have to do it backwards. Collect device address to the subarrays in a host array, then allocate the top-level device array and cudaMemcpy() the pointers from host to device.

Better still, don’t allocate storage this way on the device if you have any choice. Pointers to pointers with CUDA are painful to setup (as you can see) and not very good for performance. Multidimensional arrays are best represented as a 1D array with precomputed strides for each dimension.

No I am not. Even if I remove the cudaMallocHost part, the program is crashing (it seems weird cause I already did it a thousand times).

Can you give me a simple example?

I will do that later, maybe, but I want this way to work.

Thank you

But you are:

cudaMalloc((void**)&res_dev[i],sizeof(int3)*leng[i]);

cudaMalloc((void**)&tab_prot_dev[i],sizeof(char)*(leng[i]+1));

both res_dev and tab_prot_dev are device pointers. They cannot be indirected in host code because they do not contain valid host addresses. Further, I suspect that you are not declaring them correctly either, although we don’t get to see that code. To repeat what Seibert advised, do each of the recursive device allocations into a host array, then copy that array from host memory to an array of pointers allocated on the device.

Obviously there is something that I misunderstand.

That’s why I do a cudaMalloc and not a simple malloc, right?

Here is my light test program:

int main(){

	int3 **res_dev;

	int i;

	int nbprot=3;

	int leng[3];

	

	leng[0]=123;

	leng[1]=15;

	leng[2]=10;

	

	cudaMalloc((void**)&res_dev,sizeof(int3 *)*nbprot);

	

	for(i=0;i<nbprot;i++)

		cudaMalloc((void**)&res_dev[i],sizeof(int3)*leng[i]);

	

	for(i=0;i<nbprot;i++)

		cudaFree(res_dev[i]);

	cudaFree(res_dev);

	

	return 0;

}

This is what I think I do:

-allocate a device array of pointers

-in each cell of the array, allocate an array

It seems to be a simple issue but there is something semantic that I don’t understand here.

No. cudaMalloc stores a device address in a host pointer. That pointer cannot be indirected in host code - it does not contain a valid host address, it contains an address in GPU memory.

Your repro case should be written something like this (warning not been near a compiler):

int main(){

  int3 **res_dev, **res_host;

  int i;

  int nbprot=3;

  int leng[3];

leng[0]=123;

  leng[1]=15;

  leng[2]=10;

res_host = (int3 **)malloc(sizeof(int3 *)*nbprot);

  cudaMalloc((void***)&res_dev,sizeof(int3 *)*nbprot);

for(i=0;i<nbprot;i++)

    cudaMalloc((void**)&res_host[i],sizeof(int3)*leng[i]);

cudaMemcpy(res_dev, res_host, sizeof(int3 *)*nbprot, cudaMemcpyHostToDevice);

for(i=0;i<nbprot;i++)

    cudaFree(res_host[i]);

cudaFree(res_dev);

  free(res_host);

return 0;

}

Device and host pointers are not interchangeable. Hopefully this example also shows why you should take Seibert’s very sound advice and consider abandoning jagged arrays in favour of linear memory.

at some point in the past three years (now I feel old), I posted how to create an array of arrays on the device. can’t find it now, but needless to say it was extremely painful. don’t do it, because it also performs terribly.

Okay ! That is a big revelation for me :D !

I could find it either but I will not do that.

Thank you all for your time and advices