Problems in a volume rotation

Hi to everyone,
I’m trying to implement a 3D rotation of a volume considering the rotation as a sum of 2d rotation along each direction. I think it’s simpler than considering a custom 3 rotation, also because I have 2d bilinear interpolation for free :yes:.
The idea is to load the volume entirely into GPU memory , and use an algorithm that is very similar to the one found into SDK (it’s called simple texture), loading each slice one by one in a texture and obtain the new image interpolated directly by texture fetching from the old one. I’ve already overcome to some initial problems, like the conversion from unsigned short int format of the volume to float (necessary for bilinear interpolation), but I can’t implement the rotation. What I obtain is an image that seems to be rotated, but It’s full of NaN.I think that it could be derived from the choice of block and grid dimensions, I also tried to pad the volume to a dimension that is power of 2, but nothing better happens…
Anyone could have some kind of an Idea?
I’ll really appreciate it.
Thanks for you help,


Just as an aside, in the CUDA 2.0 beta, you now get 3D interpolation for free as well. :)

Good to know it, it could be a future, simpler implementation of my algorithm ;) . But I’m wondering why I’m getting so many NaN into my image…could it be derived to a bad choice of block dimensions?Could a bad coalescent reading give read-error?

It’s hard to say without seeing your code. Does it run correctly in emulation mode?

Memory coalescing doesn’t affect correctness, just performance.

BTW, it’s actually not necessary to use float format to get interpolation - you just need to use “cudaReadModeNormalizedFloat” as the read mode.

I’ve found where my problem was. I was specifying a bad amount of bytes into the function cudaMemcpyToArray, so it was crashing.

For what concern the interpolation, tell me If I’ve understand right:

-I have to load my volume as unsigned short int

-I have to declare my texture as texture<unsigned short int, 2, cudaReadModeNormalizedFloat> tex;

-I perform the rotation , obtaining int again.

using just 2 bytes for each pixel, I should be able to play with volume quite large (even 700^3) B)

I’m also curious about the new texture3d type introduced with cuda 2.0…performing a full trilinear interpolation should be really faster than a composition of 3 bilinear itnerpolation… :D

By all means, give 3D textures a try, they work on the entire installed base of CUDA-capable hardware.

To enable interpolation, you must change the filtering mode associated with the texture. Note, the interpolation is lower precision than what your kernel could compute manually.

Also, 3D textures have different max dimensions (2048^3 as opposed to 64K x 32K).

Could it be possible in the future to have a 3D texture that is 16x16x(more than 2048) ?
I have 2 dimensions that stay quite low, but a third dimension that could become more than 2048 (for now 2048 will do I think (will have to check), but that might soon not be enough anymore)
Or are there hardware limitations that prevent this? (I can imagine, given that this will probably not occor so quickly with graphics)

Unfortunately It’s impossibile to install CUDA 2.0 beta together with 1.1. This is bad, because I cannot simply test the new features of CUDA 2 just simply linking to a libreary instead of another…Why have you chosen to not permit coexistence?

Why the precision of trilinear interpolation is worse than the bilinear one?I would have aspect that and ipothetical tex3D would be quite precise as tex2D…

My idea is that if you consider the volume as a stack of texture, you could perform the rotation with only 1 volume in memory and some image buffer for the rotation. On the other hand, if you want to implement a full 3D rotation with trilinear interpolation, you will need 2 different volumes (because the rotation cannot be performed in-place). So considering a “human” equipment (8800 GTX with 768 MB of RAM) you can store 2 volumes up to 700^3 elements with int precision.

Any way, I have a quick question to ask: I’m performing the rotation on the other axe (the y) and for all the slices I need to build up a texture taking from several parts of the volume, so I decided to write a kernel for it that looks more or less like this:

global void reshuffle(float *VolIn, float *SliceOUT, int width, int height)


    const int threadID = blockIdx.x * blockDim.x + threadIdx.x;

const int numThreads = blockDim.x * gridDim.x;

float Src;

for (int index = threadID; index < width*height; index +=numThreads)







//	Src=vol[threadID*width*height];

//	SliceToBeRotated[threadID]=Src;}


The problem is that during the execution of my program I receive “invalid device pointer”

What I did wrong?And moreover which one of the two version is quicker? I suppose the second one (the commented one)…

The coalescent reading is still a bad beast for me…

Again, thanks for you help guys…

Finally I’ve got the algorithm working. It performs the rotation of a volume 250^3 along the three axis in ~300 ms…that’s not bad :thumbup:

The real bottleneck is reading the volume (that is stored in unsigned short) into float and back. So, I’m trying to work directly with unsigned short. If I’ve understood well what I have to specify are these lines:

  • the texture must be declared like this:

        <b> texture<sInt, 2, CudaReadModeNormalizedFloat> tex;</b> 
  • the array must be declared like this:

        [b]cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(16, 0, 0, 0, cudaChannelFormatKindUnsigned);
            cudaArray* cu_array;[/b] 
        <b>GPU(cudaMallocArray(&cu_array, &channelDesc, XDIM, YDIM));</b> 
  • and the kernel that obtain interpolated data has to look like this:[b]

global void transformKernel(sInt *g_odata, int width, int height, float theta)


// calculate normalized texture coordinates

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

float u = x / (float) width;

float v = y / (float) height;

// transform coordinates

u -= 0.5f;

v -= 0.5f;

float tu = ucosf(theta) - vsinf(theta) + 0.5f;

float tv = v*cosf(theta) + u*sinf(theta) + 0.5f;

// read from texture and write to global memory

g_odata[y*width + x] =tex2D(tex, tu, tv);


With this setup I read back a volume that is completely empty. On the other hand, If I specify a channeldescriptor like this:

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc( 32 , 0, 0, 0, cudaChannelFormatKindUnsigned);

I get back the initial volume without any changes…it seems that I don’t perform any interpolation…

Any ideas?

Thanks for you help,