Advantage of using cudaMalloc3D How is it advantageous to use cudaMalloc3d for voxel operations

I have written a simple(amateur) code for averaging the surrounding voxel values. In one, I used ‘cudaMalloc’ and in the other ‘cudaMalloc3D’. I did not find a significant advantage on using the later.

Can somebody explain to me how using cudaMalloc3D could be advantageous for voxel operations. Also give me examples for which it has to be used to avoid memory transfer latency.

Awaiting a reply…!!

this function makes sure you have an aligned starting address for each dimension.

since devices with cc 1.3 or higher have less problems of accessing memory segments with unaligned starting address than older devices you will only feel the speedup on old devices.

well i believe that, but who knows…

Thanks for the quick reply.

Alignment is fine. Does it give me any increase in the performance?? Or is it best to use the shared memory, i.e copy the subvoxelvolume to shared memory and then compute??

Hi Vijaeendra,

I looked at this exact issuea few months ago, also working on a voxel domain.
I saw a DECREASE in performance when moving from flattened (linear) storage to the pitchedPtr method using cudaMalloc3D.
The results were identical so there was no problems with implementation, it just ran logarithmically slower as the simulation time went on (see link below). This was very strange behaviour, but I thought it might have something to do with having to copy more padding cells? Was never sure.

See this post: The Official NVIDIA Forums | NVIDIA (no one seemed to be able to help)

I have not been able to explain why and since have pretty much moved on, still using linear storage.
Perhaps you can try both methods yourself and do some benchmarking to see if you can find some answers.

Regards,
Mike