Hi everyone,
[EDIT]
Seeing as there is no interest in this question whatsoever, I’ll ask right up front: has anyone benchmarked the performance of their kernels with [font=“Courier New”]cudaMemcpy()[/font] vs [font=“Courier New”]cudaMemcpy3D()[/font]? I am seeing a DESCREASE in performance using the latter and would like to know why. Below are the details of my usage.
===============
Background:
Up till now I have been using regular linear [font=“Courier New”]cudaMemcpy[/font]'s and manually indexing my array accesses. I have 3 large-ish arrays (hundreds^3). The logical dimensions of each are identical, the data types are as follows: 1 array of [font=“Courier New”]unsigned char[/font], 2 arrays of [font=“Courier New”]short[/font].
Just now I decided to move all global mem accesses to [font=“Courier New”]cudaMemcpy3D[/font] because, as stated in the Programming Guide, the memory alignment will be taken care of. I’m using the method of accessing the elements by calculating the [font=“Courier New”]pitch[/font] and [font=“Courier New”]slicePitch[/font] parameters as explained on p. 22 of the Programming Guide. Everything runs correctly. Great.
However, it runs exactly 22% slower than before [see update below], irrespective of the number of iterations. I’m not sure why this should be the case – I would have expected nicely aligned memory to be equally fast, if not faster.
The access patterns are as follows:
[*]Copy 1 unsigned [font=“Courier New”]char[/font] array (lookup array), 2 [font=“Courier New”]short[/font] arrays (A1, A2) using [font=“Courier New”]cudaMemcpy3D(…HostToDevice)[/font];
[*]Collaboratively read 8^3 regions per thread block from lookup array to shared mem (coalesced load);
[*]Do iterative calcs: reading once from A1 to determine N, iterate N times, and write to an intermediate UPDATE array in shared mem;
[*]Once calcs done for threadblock, collaboratively write from UPDATE array to A2 (coalesced store);
[*]Once all threadblocks done, kernel execution returns to host: overwrite old A1 in global mem with updated A2 array from host using [font=“Courier New”]cudaMemcpy3D(…DeviceToDevice)[/font]
[*]Increment simulation timestep, repeat from step 2.
So the slowdown is 22% regardless of how many times this is repeated. [see update below]
Does anyone have experience using [font=“Courier New”]cudaMemcpy3D[/font] vs regular [font=“Courier New”]cudaMemcpy[/font] have any insight as to the reasons for performance penalty? Is it the host update A1->A2 operation that is slowed down because there are more elements to copy due to padding?
Cheers,
Mike