I’m trying to use the matrix transpose code sample from CUDA SDK.
It works well for square matrices and power of 2 matrices.
I now have 2 problems :
For matrices smaller than 1616, it doesn’t work since there is no threads created. I corrected that by using the naive version for matrices smaller than 1616 and by calling it with :
dim3 grid((size_x+15) / 16, (size_y+15) / 16, 1);
For fully arbitrary matrices (ie. non power of 2, non square), I get strange results (I still call it with the same dim3 grid as above) : I get several “0” in my matrix that should not be…
Is there any solution ?
Thank you very much in advance,