Some newbie questions for raytracing with CUDA


i am trying to implement a raytracer with CUDA. But i have some questions:

  1. I’m reading some obj-files, where the vertices for the raytracing are stored. When I upload this data to the GPU with cudaMalloc, it spends too much time and the kernel loading failed. Is it better to upload this data in a texture or in constant memory? It should be thousands of triangles…or is there another solution, what I forgot?

  2. When i am rendering an image with two or three triangles, my frame rate is only 10 fps. What could be the problem? I would like to have some ray packets, but i don’t know how to do this?

  3. At the rendering process i could see some noise, as it would be the rays… I think, when i could raise the frame rate, this problem could be solved.

Thank you,

Greetings thopil <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Texture. But, cudamalloc does not copy data to GPU, so please post some code.

Post some code, because we are not psychic ;)

Get it working with the mesh data in global memory first, then move on to binding it to a texture.

You’ll need to profile your application. There’s not much point to ray packets with CUDA, as the simd architecture essentially gives you packets for free.


first thanks for your replies!!! :D

int main(int argc, char **argv)



    CUDA_SAFE_CALL(cudaMallocHost((void**) &triangles_host, numTriangles * sizeof(Triangle)));


    CUDA_SAFE_CALL(cudaMalloc((void**) &scene_device, sizeof(Scene)));

    CUDA_SAFE_CALL(cudaMalloc((void**) &triangles_device, numTriangles * sizeof(Triangle)));



void displayFunc(void)



CUDA_SAFE_CALL(cudaMemcpy(scene_device, &scene, sizeof(Scene), cudaMemcpyHostToDevice));

    CUDA_SAFE_CALL(cudaMemcpy(triangles_device, triangles_host, numTriangles * sizeof(Triangle), cudaMemcpyHostToDevice));



    dim3 threads(BLOCKSIZE, BLOCKSIZE);



    raytrace<<<grid, threads>>>(d_dst, scene_device, numTriangles, triangles_device);



__global__ void raytrace(Color *dst, Scene* global_scene, int numTriangles, Triangle* global_triangles)



    if((tx == 0) && (ty == 0))


        shared_scene = *global_scene;

        shared_triangles = global_triangles;

        shared_numTris = numTriangles;


   const int screenX = tx + bx * BLOCKSIZE;

    const int screenY = ty + by * BLOCKSIZE;

    const int idxC = screenY * PIC_WIDTH + screenX;



    ...doing... object-ray intersection and shading....



Oooh… i thought I have to allocate the memory for gpu with cudaMalloc and the copy it up with cudaMemcpy…

How do you mean this?

Thank you for this advice, but I thought I could generate rays for a block and such a block would be a ray packet. But so I have thought wrong… :whistling:

Thank you and bye,

Greetings thopil

From looking at your code, it seems that you are doing a few cudaMemcpys per frame. How are you displaying the final pixels? Do you do a cudaMemcpy back to the host? If so, this is likely your bottleneck.

kernel arguments are already in shared memory. You are not copying anything into shared memory, just the pointer.

I think it is best to put your geometry into 1D arrays (float4), bind a texture to them, and access the geometry with a texture. How you are doing it right now looks to be by accessing it from global memory, which means all your accesses are non-coalesced.

Run your program through the visual profiler, there you can collect all the statistics on non-coalesced accessess you may ever need.

Hey man, you dont have to call that a noob question… I don;t even know this programming language that you guys are speaking. I can talk hardware, but not software programming.