I hope I get an answer to my question as this is getting quite weird for the last few days.
While doing some research I’m required to implement some memory constrained kernels. The memory is as follows: I have a 512*384 image and I need 56 (or 64) values for each pixel. Each of these values is an integer. For storage I choose the cudaPitchedPtr data type and I’m defining a 3d memory with x=512, y=384 and z=56. I’m using a GTX280 device for tests. The issues I’m having are as follows:
When applying a slightly modified version of the diagonal transpose I’m getting only around 40GB/s. Although this in itself is not a problem, but it may indicate some performance issues. The code was modified to handle the z-axis by introducing an additional for loo around the main transpose code. I don’t think that this modification should reduce the performance this badly (~50% running time increase compared to performance indicated in documentation).
In different kernels I’m required to use blocks with the configuration 16x8->16x32. When the first dimension (threadIdx.x) is applied on the x-axis and the second dimension (threadIdx.y) is applied on the y-axis, I’m getting somewhere around 50-60GBs for a copy operation, but when applying the second dimension on the z-axis, running time increases almost 3 times. An example of access for the second method is as follows
I have checked my code for several days already and I don’t seem to find any errors and at also returns the correct result every time. The problem is that I’m in the optimizing phase and this would help a lot if it could be resolved.
I will try to perform more controlled tests in the following week, but maybe someone could give me some hints as to what may cause this problem.