In my algorithm I used both linear memory (allocated with cudaMalloc) and Pitched memory (allocated with cudaMalloc3D) but to my surprise the linear memory showed the best performance than aligned memory. Someone have an idea of why this is happening?
Note: I tested this algorithm in 3 differents architectures (Fermi, Kepler and Maxwell).
Could you show a simple buildable, runnable program that would allow others to reproduce your findings? With no notion of what your code is doing and what is being timed, it is impossible to diagnose what may be happening.
Unfortunately I can’t show the buildable code because it’s a part of a great software framework. But I used the same kernel for linear and pitched memory. The only difference between the codes is the iteration of the data.
Linear Memory code:
unsigned int xOffset, yOffset, zOffset, fOffset;
xOffset = uint_t(1);
yOffset = uint32_c( f.xAllocSize() ); // xAllocSize() = size of the x dimension.
zOffset = uint32_c( yOffset * f.yAllocSize() );
fOffset = uint32_c( zOffset * f.zAllocSize() );
Build a new working test case out of your code, that just shows the kernel and the minimum necessary to launch it and compare timing. Make sure it is a complete code, that someone else can compile and run.