Seaching a fast 3d Rotation

Hi @all,
i am currently searching an implementation of a 3d rotation because mine is not working <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> . The size of the data will be around 512512512. Is there any implementation for that?
regards
bin

You mean the kind that powers 3D graphics? Where you multiply your coordinates by a 4x4 rotation matrix and use the new coordinates to sample a texture?

It should be extremely easy, although optimizing for good memory access would be much trickier.

Interesting problem! I would have looked at the possibility of factorizing the rotation into a series of shear and scaling operations. Making sure that memory operations are coalesced ought to be easier with shears than with rotations. The only problematic dimension is the first (x). One could imagine that a quick transpose (a mirroring along the x-y diagonal for example) could be used to solve that. By the way, I found this paper that might help you: http://citeseer.ist.psu.edu/301520.html.

One thing: do not store the result in the same variable
There is a ptxas bug preventing from you performing matrix-vector multiplication and store it in the source vector.
See this post:
[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

However, seeing “512512512”, seems you’re doing a rotation on volume data. In that case, I don’t think that is a CUDA issue, you should look for an working algorithm first. Could try marching cubes + render to volume.

the problem is that i have to interpolate after the rotation . this interpolation takes the points from all 6 direct connected points ( left,right,up,down,front,back) . this is something “strange” memory shifting.

is there a possibility to start 512512512 threats (e.g. for each pixel) would be nice for the start to look if it is working with my idea right now.

regards

Trilinear filtering too? This problem will be dead-easy (and very fast) once 3D textures are introduced to CUDA (which I think won’t be at least till next year). In fact, you might consider ditching CUDA and just use directx. There’ll be zero memory issues (yay texture cache), very high performance, and complete hardware independence.

But for now, how about you post your current implementation so we can see why it doesn’t work?

ok the idea is great BUT i need to do it on Fedora. and i have to do at least 360360180 rotations per calculation so … it should be really really fast.

opengl should expose 3d textures too.

Depending on the implementation of the texture cache (a good opportunity to compare nvidia vs ati here), this operation will be very, very fast. Ideally, it will be as fast as a memcpy.

Hi,

some (for me) strang failure occurs if i want to set a float .

   memSize = size_x * size_y*size_z*sizeof(float);

    CUDA_SAFE_CALL(cudaMalloc((void**)&devMemImg, memSize));

    CUDA_SAFE_CALL(cudaMalloc((void**)&devMemRotImg, memSize));

......

CUDA_SAFE_CALL( cudaMemset(devMemRotImg, 0, memSize) );

Hi,

some (for me) strang failure occurs if i want to set a float .

i’ll do something like

   memSize = size_x * size_y*size_z*sizeof(float);

    CUDA_SAFE_CALL(cudaMalloc((void**)&devMemImg, memSize));

    CUDA_SAFE_CALL(cudaMalloc((void**)&devMemRotImg, memSize));

......

    CUDA_SAFE_CALL( cudaMemset(devMemRotImg, 0, memSize) );

......

and then in the kernel i want to set (x is some index)

   devMemRotImg[X]=0;

and then the error occurs

line 193 is

CUT_CHECK_ERROR("Kernel execution failed");

what could this be because the whole procedure is working fine with emu=1. without this the procedure is running and doesn’t do any usefull.

regards

Well, if you’re still doing 512512512*sizeof(float) datasets, then obviously you can’t fit two of those into 768MB. The kernel will also fail if other resource limits are exceeded, like with too many threads in a block (not enough registers).

mmm ok i have now reduced my image to 323232 to avoid that. but the same error occurs. any other ideas. (code is available at http://www.codeproject.com/script/comments…353&msg=2256353 , it is only the try out code so its a liitle bit ugly but should work i think but it doesn’t)

What can I say. Reduce your program to as little as possible until it does work, and then build back up. Also add the option nvcc -keep so that you get the cubin which will tell you how many registers, shared mem, and constant mem your kernel is requiring. There’s also always the possibility of a compiler bug, but don’t bet your barn on it.

Now i have found that i could not write to the memory where my rotated image should be. is there any possibility to make sure that the allocated memory is not in a read only memory.?

Uh, sure. Just allocate it with cudaMalloc(). Any thread in any block should be able to write to any memory location in the region allocated.

i do allocat the mem with cudaMalloc()

   CUDA_SAFE_CALL( cudaMemset(devMemRotImg, 0, memSize) );

    CUDA_SAFE_CALL( cudaMemset(devMemImg, 0, memSize) );

   CUDA_SAFE_CALL(cudaMemcpy(devMemImg, hostImg, memSize, cudaMemcpyHostToDevice));

but if i run the kernel there is the problem

devMemRotImg[(r_z_l*size_y+r_y_l)*size_x+r_x_l] = (num_t) InterpolatedValueS (devMemImg, size_x, size_y, size_z, i, j, k);

i have tried to do several things. i can read and write to devMemImg and read from devMemRotImg but i cann’t write to devMemRotImg

any othere idea.

regards

ps: link to a version is included in one former post.

I haven’t tried to decipher your code, but are you absolutely sure that (r_z_l*size_y+r_y_l)*size_x+r_x_l never indexes outside of the array memory bounds? You could check this by running the kernel in emulation mode through valgrind on linux.

i am sure because in emulator mode it is running as it should. means no error or problems and the result is also as it should look like. the problem starts if i try to run in debug emulator mode or debug or release mode. there i get the error that something with the kernel is wrong. but i don’t know why because if it is working fine in emulator mode shouldn’t it work in all oder modes?

“Working fine” doesn’t mean that you never write past the end of the array. You have to write outside your programs addressable space to get a seg fault.

Try running the emulation mode executable through valgrind as I suggested. It is a wonderful tool that can tell you whenever you write past the end of an array (even if it won’t produce a seg fault) and even what line of code is causing the problem.

Note: I’m only pushing this memory access check because I usually get “unspecified launch failure” when I accidentally write past the end of an array. I did just remember that I’ve gotten it before when my kernel requested a block size too big to run on the device too, so I looked at your code. You are calling the kernel like this:
rot3d<<>>(devMemImg, devMemRotImg, size_x, size_y, size_z, rotMatrix);
Why aren’t you passing the block and grid specifications to the kernel? I would guess that it is using uninitialized memory there and probably trying to run a thread block with 32897 threads per block and the device isn’t happy about that. How it works in emulation mode, I have no idea.