Advantages of using cudaMemcpy3D, cudaPitchedPtr over regular cudaMemcpy? I am seeing penalty using

Hi everyone,

[EDIT]

Seeing as there is no interest in this question whatsoever, I’ll ask right up front: has anyone benchmarked the performance of their kernels with [font=“Courier New”]cudaMemcpy()[/font] vs [font=“Courier New”]cudaMemcpy3D()[/font]? I am seeing a DESCREASE in performance using the latter and would like to know why. Below are the details of my usage.

===============

Background:

Up till now I have been using regular linear [font=“Courier New”]cudaMemcpy[/font]'s and manually indexing my array accesses. I have 3 large-ish arrays (hundreds^3). The logical dimensions of each are identical, the data types are as follows: 1 array of [font=“Courier New”]unsigned char[/font], 2 arrays of [font=“Courier New”]short[/font].

Just now I decided to move all global mem accesses to [font=“Courier New”]cudaMemcpy3D[/font] because, as stated in the Programming Guide, the memory alignment will be taken care of. I’m using the method of accessing the elements by calculating the [font=“Courier New”]pitch[/font] and [font=“Courier New”]slicePitch[/font] parameters as explained on p. 22 of the Programming Guide. Everything runs correctly. Great.

However, it runs exactly 22% slower than before [see update below], irrespective of the number of iterations. I’m not sure why this should be the case – I would have expected nicely aligned memory to be equally fast, if not faster.

The access patterns are as follows:

    Copy 1 unsigned [font=“Courier New”]char[/font] array (lookup array), 2 [font=“Courier New”]short[/font] arrays (A1, A2) using [font=“Courier New”]cudaMemcpy3D(…HostToDevice)[/font];

    Collaboratively read 8^3 regions per thread block from lookup array to shared mem (coalesced load);

    Do iterative calcs: reading once from A1 to determine N, iterate N times, and write to an intermediate UPDATE array in shared mem;

    Once calcs done for threadblock, collaboratively write from UPDATE array to A2 (coalesced store);

    Once all threadblocks done, kernel execution returns to host: overwrite old A1 in global mem with updated A2 array from host using [font=“Courier New”]cudaMemcpy3D(…DeviceToDevice)[/font]

    Increment simulation timestep, repeat from step 2.

So the slowdown is 22% regardless of how many times this is repeated. [see update below]

Does anyone have experience using [font=“Courier New”]cudaMemcpy3D[/font] vs regular [font=“Courier New”]cudaMemcpy[/font] have any insight as to the reasons for performance penalty? Is it the host update A1->A2 operation that is slowed down because there are more elements to copy due to padding?

Cheers,

Mike

On closer examination, it appears the slowdown is logarithmically increasing with the number of iterations. It was within 1% for the first couple of trial runs I didn’t notice. See attached graph.

This is quite interesting behaviour, which I cannot explain.

Profiling the app with Visual Profiler before and after shows that both the host-side [font=“Courier New”]cudaMemcpy3D[/font] calls as well as the global mem accesses from within the kernel (using the [font=“Courier New”]pitch/pitchSlice[/font] constructs as I mentioned earlier) are both taking longer (each [font=“Courier New”]cudaMemcpy3D[/font] call time is roughly constant since the same size mem is being transferred).
slowdown.png

Have you called cudamemcpy3D earlier in your code? I’ve read that the first call takes a little longer as it needs to do all the cuda backend initializations.

I did some benchmarking on cudamemcpy2d and found that the times were more or less comparable with cudamemcpy. It was interesting to find that using cudamalloc and cudamemcpy vice cudamallocpitch and cudamemcpy2d for a matrix addition kernel I wrote was faster. I didn’t notice any advantage in using one or the other although the best practices guide recommends using pitched memory access.

**Using Compute 2.1, Cuda 4.1, GTX 460)

Have you called cudamemcpy3D earlier in your code? I’ve read that the first call takes a little longer as it needs to do all the cuda backend initializations.

I did some benchmarking on cudamemcpy2d and found that the times were more or less comparable with cudamemcpy. It was interesting to find that using cudamalloc and cudamemcpy vice cudamallocpitch and cudamemcpy2d for a matrix addition kernel I wrote was faster. I didn’t notice any advantage in using one or the other although the best practices guide recommends using pitched memory access.

**Using Compute 2.1, Cuda 4.1, GTX 460)