2d grid and block performance


I’m trying to make some performance comparison between several CUDA implementations of the same problem, that is an weighted-jacobi iterative solver.
But the problem is that when I use the 2d grids and blocks I have a very poor performance, about 3,5 times slower than the 1d grid and block implementation.
I’ve already checked the coalesced memory accesses with the cudaProfiler and I dont have any uncoalesced. I’m using the textures for the input vector, but I dont think that the memcpy needed for the 2d version is the bottleneck since I’ve removed this part of the code for testing.

I am basically calling the kernel 1000 times for each version, and each kernel call makes one w-jacobi step. It is a very simple implementation. And for the 1d version I am using 1d textures (tex1Dfetch for getting the values) and for the 2D (tex2D).

Something that I noted in the cudaProfiler is the increase of the number of instructions that is 4 times greater in the 2d version. But I’ve checked the code for both and it seems that I have only about 30% more instructions for the 2D. Does the 2d grid hide some instructions?

Thank you advance,

Sorry for my english. :">