First cudaMalloc in program failing


I’m currently attempting to make a ray tracer parallel using CUDA, and I can’t figure out why the first cudaMalloc I use in the program is failing.

In my program, main() accepts all of the input appropriate for loading the scene data, and then a load_scene() function is called. (load_scene() is a host function, by the way.) In this function, the first cudaMalloc of the program is used to allocate memory for an array of sphere/object structs which will be present in the scene. The array of objects, called obj_listdev, is initialized globally, outside of any function. I initialized it in the following manner:

device struct sphere *obj_listdev = 0;

Main() then calls load_scene(), which begins with the following lines of code:

void load_scene(FILE *fp) {
char line[256], *ptr, type;

if (cudaSuccess != cudaMalloc((void**)&obj_listdev, (sizeof (struct sphere)))) {
    printf("cudaMalloc failed: obj_listdev");
obj_listdev->next = 0;

Could anyone give me a hint as to why this cudaMalloc is failing?


The first argument to cudaMalloc() needs to be a pointer (to a device pointer) on the host, not on the device. You can then use this pointer as an argument to a kernel, or copy it to the device.

Further, you cannot dereference the device pointer on the host, so [font="'Courier New"]obj_listdev->next = 0[/font][font=“Arial”] won’t work either. You’d either need to use cudaMemset() on an address that you calculate from the pointer you obtained previously from [/font]cudaMalloc() and the offset within the struct, or run a kernel on the device to do this.

Most of the time it is not a good idea to use complex dynamic data structures like linked lists on the device. Usually its both simpler and faster to flatten them on the host and then just operate on flat arrays on the device. This avoids much of the host pointer/device pointer hassles, and gives much better coalesced memory access patterns.

Thanks for the response! I understand you are telling me it is not a good idea to use linked lists on the device, but is it also not a good idea to have arrays of structures? Should I flatten out the data contained in my structures, in addition to the contents of the linked list?

Usually it is better to have structures of arrays (or just multiple arrays) than arrays of structures, because it is then possible to coalesce array accesses.
The exception is if you can load the whole structure into registers or shared memory with a single instruction.

On compute capability 2.x devices this is less important due to global memory being cached. The cache isn’t large however, so (depending on the size of the structure and the occupancy of the SM) a structure of arrays might still be preferable.