I am doing some allocations using CUDA 4.0, Ubuntu 64 bits and a 470 GTX and for unexplained reasons, a cudaMalloc makes my program crashing.
This is a part of my code:
//nbprot = 3 and the lengths are about 10000
HANDLE_ERROR( cudaMallocHost((void**)&results,sizeof(int3 *)*nbprot));
HANDLE_ERROR( cudaMalloc((void**)&res_dev,sizeof(int3 *)*nbprot));
HANDLE_ERROR( cudaMalloc((void**)&tab_prot_dev,sizeof(char *)*nbprot));
for(i=0;i<nbprot;i++){
HANDLE_ERROR(cudaMalloc((void**)&res_dev[i],sizeof(int3)*leng[i]));
HANDLE_ERROR( cudaMalloc((void**)&tab_prot_dev[i],sizeof(char)*(leng[i]+1)));
HANDLE_ERROR( cudaMallocHost((void**)&results[i],sizeof(int3)*leng[i],cudaHostAllocDefault)); //page-locked memory
}
The error occurs during the first turn of the loop, at the first instruction.
It started since I have put some cudaMallocHost allocations so I think this is related but even when I remove that part, my cudaMalloc crashes. I changed the size of memory allocated and it still crashes.
I know that I am not using too much memory on my card since I do not make any other allocation before (and after) and my total amount of device memory allocated is about 2 Mo.
The key here is to think about where your pointers are pointing. You appear to be trying to make an array of pointers to other arrays on the device. The problem is that you first allocate the top-level array on the device, then try to pass pointers to those top-level memory locations to a host function. As soon as cudaMalloc() tries to write the pointer to the newly allocated memory segment in the loop, it dereferences a device pointer on the host and segfaults.
To allocate arrays like this, you have to do it backwards. Collect device address to the subarrays in a host array, then allocate the top-level device array and cudaMemcpy() the pointers from host to device.
Better still, don’t allocate storage this way on the device if you have any choice. Pointers to pointers with CUDA are painful to setup (as you can see) and not very good for performance. Multidimensional arrays are best represented as a 1D array with precomputed strides for each dimension.
both res_dev and tab_prot_dev are device pointers. They cannot be indirected in host code because they do not contain valid host addresses. Further, I suspect that you are not declaring them correctly either, although we don’t get to see that code. To repeat what Seibert advised, do each of the recursive device allocations into a host array, then copy that array from host memory to an array of pointers allocated on the device.
No. cudaMalloc stores a device address in a host pointer. That pointer cannot be indirected in host code - it does not contain a valid host address, it contains an address in GPU memory.
Your repro case should be written something like this (warning not been near a compiler):
int main(){
int3 **res_dev, **res_host;
int i;
int nbprot=3;
int leng[3];
leng[0]=123;
leng[1]=15;
leng[2]=10;
res_host = (int3 **)malloc(sizeof(int3 *)*nbprot);
cudaMalloc((void***)&res_dev,sizeof(int3 *)*nbprot);
for(i=0;i<nbprot;i++)
cudaMalloc((void**)&res_host[i],sizeof(int3)*leng[i]);
cudaMemcpy(res_dev, res_host, sizeof(int3 *)*nbprot, cudaMemcpyHostToDevice);
for(i=0;i<nbprot;i++)
cudaFree(res_host[i]);
cudaFree(res_dev);
free(res_host);
return 0;
}
Device and host pointers are not interchangeable. Hopefully this example also shows why you should take Seibert’s very sound advice and consider abandoning jagged arrays in favour of linear memory.
at some point in the past three years (now I feel old), I posted how to create an array of arrays on the device. can’t find it now, but needless to say it was extremely painful. don’t do it, because it also performs terribly.