Returned pitch in cudaMallocPitch

I allocated a 2D array using cudaMallocPitch as follows:

size_t pitch = 0;
int width = sizeof(float)*420;
int height = 42840;
float* d_ex;
CUDA_SAFE_CALL( cudaMallocPitth( (void**) &d_ex, &pitch, width, height));

When I used cuda-2.0, the returned value of pitch was sizeof(float)*432, where 432 is divisible by 16 (half the warp size in compute capability 1.x).

When I used cuda-6.5, the returned value of pitch was sizeof(float)*512, where 512 is divisible by 128.

I thought the returned value of pitch in cuda-6.5 needs only be sizeof(float)*448, where 448 is divisible by either 32 or 64 (but not by 128).

Why does cuda-6.5 prefer to waste device memory by choosing 512 instead of 448?

(This larger returned value of pitch in cuda-6.5 causes my program not only to take up more device memory, but also, I think, to waste memory bandwidth by padding too many zeroes to the array.)

Do your observations pertain to a controlled experiment, that is, other than the change in CUDA version, are you using the same hardware and software platform?

Generally speaking, the pitch used by cudaMallocPitch() is a function of GPU hardware alignment requirements (which can vary with compute capability) and performance considerations (which could vary with CUDA version). The memory used for padding should not be accessed, so should not impact memory bandwidth requirements. Are you seeing evidence to the contrary? Obviously the total amount of device memory required increases as pitch granularity increases.

txbob may have some additional insights into this, for example are there any design changes or known issues with cudaMallocPitch() in CUDA 6.5? I am not aware of any.

Basically, I observed that the cuda program I referred to, compiled with cuda-2.0, sm_13 and maxrregcount=16, gave pitch = sizeof(float)*432 on a GTX280, but pitch = sizeof(float)*512 on a GTX780.

The same program compiled with cuda-6.5, sm_13 and maxrregcount=16 gave pitch = 0 (!) on the GTX280, but pitch = sizeof(float)*512 on the GTX780. (Of course, the kernel would not launch on the GTX280 when pitch = 0.)

So, you are right that the returned pitch is dependent on the hardware and not on the cuda version.

I also suspected that there is a memory bandwidth issue with the larger pitch because my program is only 2.5 times faster on the GTX780 (cuda-6.5, sm_35, maxrregcount=0) than on the GTX280 (cuda-2.0, sm_13, maxrregcount=16), whereas I have expected it to be 3.5 times faster on the GTX780. The same software platform (linux) was used for both cards.

I will experiment with using my own pitch (using cudaMalloc instead of cudaMallocPitch) and see what happens.

sm_35 and sm_13 are very different architectures, you may need to make some adjustments to your code (e.g. thread block configuration) to get optimal performance on the newer GPU. As there are many more compute cores on the GTX 780, you may need to increase the amount of exposed parallelism. I would suggest letting the profiler guide you to the bottlenecks in the code.

After more testing, I have found that the GTX780 can be anywhere from 2 to 4 times faster than the GTX280 depending on the problem size.

So, I am now happy with the performance of the GTX780. It is unfortunate that Nvidia has removed this model from their product line.