Different run times depending on axis?

I have an app that works with 3-d float data. It applies a 1-d operator in x, then in y then in z. The X dimension is the fastest access, followed by Y and then Z. There are actually 4 volumes in memory, 3 input and 1 output. For testing I’m using a volume that has dimensions of 150 x 534 x 534. This is about 1/4 the size I actually need, but I can’t allocated the arrays on the 32 bit system (hint - need 64 bit os).

To process the operator and due to the 512 thread limit and to make the data fit in shared memory, I need to process blocks of each dimension. Due to the algorithm, the first 5 and the last 6 elements are not valid (because of overlap with data that is not in shared memory yet) and so I toss out the first and last 16 values to make sure I get reasonable memory access and access stays aligned with 16 words. The operator is 12 points long.

In two cases, I first used a block size of 112. With the required overlap and expansion of the volume to have complete blocks, the dimensions increase to 192x592x592. The wall clock time is 12.1 seconds in the optimized CPU version and 3.74 seconds in the GPU version. The cuda profiler shows times as follows:

method=[ kernel2x ] gputime=[ 183964.219 ] cputime=[ 183977.000 ] occupancy=[ 0.833 ]
method=[ kernel2y ] gputime=[ 319831.594 ] cputime=[ 319838.000 ] occupancy=[ 0.667 ]
method=[ kernel2z ] gputime=[ 2076635.875 ] cputime=[ 2076601.125 ] occupancy=[ 0.667 ]

I expect the z kernel to run longer since it has the worst memory access pattern. The size of the code is 1020 bytes so I can fit 3 or 4 threads in shared memory.

If I change the block size to 128, the dimensions increase to 224x608x608. The wall clock times are 11.5 seconds for the cpu version and 2.59 for the GPU version. The code size is 1148 so I can fit 2 or 3 threads in memory. The cuda profile shows:
method=[ kernel2x ] gputime=[ 113474.336 ] cputime=[ 113489.000 ] occupancy=[ 0.833 ]
method=[ kernel2y ] gputime=[ 211501.344 ] cputime=[ 211510.000 ] occupancy=[ 0.667 ]
method=[ kernel2z ] gputime=[ 831834.875 ] cputime=[ 831830.000 ] occupancy=[ 0.667

Note the big difference in the gputime for the kernel2z between the runs. The same amount of data is processed in both runs, but the runtime is over twice as long for the z kernel.

Can someone illuminate why there is this difference in runtimes?

Also, If I add up all the times from the profiler, it doesn’t come close to the measured wall clock time. Why the difference?

Finally, I would think that getting more threads in shared memory would help, but in this case it is actually slower. It might be that the larger block size means less data is discarded and recomputed.

Thanks guys, looking good so far… :)