cudaPitchedPtr performance issues (especially 3D memory)


I hope I get an answer to my question as this is getting quite weird for the last few days.

While doing some research I’m required to implement some memory constrained kernels. The memory is as follows: I have a 512*384 image and I need 56 (or 64) values for each pixel. Each of these values is an integer. For storage I choose the cudaPitchedPtr data type and I’m defining a 3d memory with x=512, y=384 and z=56. I’m using a GTX280 device for tests. The issues I’m having are as follows:

  1. When applying a slightly modified version of the diagonal transpose I’m getting only around 40GB/s. Although this in itself is not a problem, but it may indicate some performance issues. The code was modified to handle the z-axis by introducing an additional for loo around the main transpose code. I don’t think that this modification should reduce the performance this badly (~50% running time increase compared to performance indicated in documentation).

  2. In different kernels I’m required to use blocks with the configuration 16x8->16x32. When the first dimension (threadIdx.x) is applied on the x-axis and the second dimension (threadIdx.y) is applied on the y-axis, I’m getting somewhere around 50-60GBs for a copy operation, but when applying the second dimension on the z-axis, running time increases almost 3 times. An example of access for the second method is as follows

[codebox]for(int h=0;h<height;h++)

((unsigned int*)(values.ptr))[(blockIdx.y*blockDim.y+threadIdx.y)*matri


I have checked my code for several days already and I don’t seem to find any errors and at also returns the correct result every time. The problem is that I’m in the optimizing phase and this would help a lot if it could be resolved.

I will try to perform more controlled tests in the following week, but maybe someone could give me some hints as to what may cause this problem.

Not familiar enough with what you are doing be of any great help, but have you tried using the profiler and/or occupancy calculator and see if they reveal any clues?

Sorry for responding so late, but I ran a battery of tests, to see the problem. What I found was that when the number of blocks is not a very large number than the second type of access can slow down considerably. This slow-down is not directly correlated with the number of blocks as some configurations work fast, some slow. I think this has to do with the DDR access pattern and how the GPU combines the access requests from warps and blocks. The profiler doesn’t show any differenc, except that execution time increases, sometimes almoast doubles.
A solution I found was to have 32 coalesced accesses. Interestingly this made my kernels run as fast as the first access method, even though the number of blocks was reduced to half.
I think that the problem is already discussed in the other posts regarding trasnpose and matrixMultiply so I will not continue it, but if anyone has questions I will still look into it and I will provide more detailed results.