Raycasting performance on GPU

Ok, i’m finishing my raycaster, but i’m curious about performance other of You get on this field.
My results so far are (for ‘sponza’ model http://hdri.cgtechniques.com/~sponza/files/):

GFX card: NVIDIA GeForce GTX 280 @ 1.3 GHZ
KDTree construction with SAH (depth = 20) -> 1.3 sec (with is quite ok compared to optimized CPU solution)

Average performance of traversing -> 2.7 MRays / second

IMHO the performance of traversal is poor ;(
Currently i’m using push-down with shortstack traversal algorithm.
Stack is 8 items per thread, this gives 192 threads per block (shared memory size is soo small), each thread calculates one ray … and the whole thing suffers a lot from warp divergences :/
Packetized traversal eats to many registers and at the end result is worse
(i’v experimented with 2x2 packets)

You might want to check out this guy’s implementation, he has some good tips for CUDA optimization:

NVIDIA demoed a raytracer at NVISION (and other events) that uses BVH (possibly with KD-trees in the BVH btw). As far as I remember, they also do packet traversal.

No, I believe it’s a simple thread-per-ray.

The presentation is here:

I think you are right. I had in my mind they tried to group threads (read rays) that follow comparable paths together in blocks, but that seems not correct indeed.

after some tweaking & slashing i’v went from 2.7 MRays / second to ~12 MRays / second
(i’v reduced the number of if’s blocks to the minimum, some work is done redundantly now, but the divergence is much lower)
another thing i’v discowered is almost 100% texture cache miss when sampling KDTree branches, now i’m trying to rearange nodes table to be more cache friendly :)

for the sake of experiment i’v set camera fov to 1.0 (all rays should take almost that same path – no divergences) and in this case performance was ~70 MRays / second

all tests are done looking from the corner of sponza scene, so the whole atrium is visible to the camera.


I am also starting a ray tracing project and would like to know how you debug your CUDA code. I’ve tried setting up emuDebug, but in the C++ code, when setting up the D3D texture I get an error message saying “this feature is not yet implemented”. I also noticed in the CUDA SDK that the other D3D texture examples didn’t have any debug builds.

Please let me know how you handled this,



Not yet implemented usually means you have a bad toolkit/driver combo. Some of the stuff in the toolkit isnt in the driver.
Go to the cuda download page and download the most recent stuff there if you can.

with the latest driver & toolkit (i’m using x64 windows vista version) there is no problem with binding memory to textures on emu.

the other side of the stick is that some CUDA vs. DX9 interops do not work
(cudaD3D9(Register/Map)Resource and you need to emulate this by manualy locking the texture to get pointer and bind is as texture or pass to the kernel)
but that’s only few additional lines of code so you could live with this :)

Ok, thanks guys. DarkAr, any chance you could give an example of how to lock texture manually? I’m pretty much a noob when it come to CUDA programming at the mo!

here you go:

void c_KDTREE_GPU::RayCastGPU(c_D3DTEXTURE *rt)



 IDirect3DResource9 *rttex = (IDirect3DResource9*)rt->GetTexture();

 if (cudaD3D9MapResources(1, &rttex) != cudaSuccess)

 Â freaked_error("Cannot map Cuda resource !");

void  *TexData;

 size_t TexPitch;

 cudaD3D9ResourceGetMappedPointer(&TexData, rttex, 0, 0);

 cudaD3D9ResourceGetMappedPitch  (&TexPitch, NULL, rttex, 0, 0);


 void *TexData;

 int   TexPitch;

if (rt->Lock(0, D3DLOCK_DISCARD, &TexData, &TexPitch) == FALSE)

 Â return;


// .... call kernel


 if (cudaD3D9UnmapResources(1, &rttex) != cudaSuccess)

 Â freaked_error("Cannot unmap Cuda resource !");






in other words in emulation mode you need to obtain pointers to dx9 object yourself ;)

Hello, thanks for your reply, but I’m still having trouble, do you think you could tell me what rt is? I don’t know which of my members has Lock(). Here’s my code:

bool g_bDone = false;

IDirect3D9  * g_pD3D; // Used to create the D3DDevice

IDirect3DDevice9* g_pD3DDevice;

const unsigned int g_WindowWidth = 512;

const unsigned int g_WindowHeight = 512;

// Data structure for 2D texture shared between DX9 and CUDA



	IDirect3DTexture9* pTexture;

	int width;

	int height;	

} g_texture_2d;

// The CUDA kernel launchers that get called

extern "C" 


	void runTest(void* surface, size_t width, size_t height, size_t pitch, float t);



// Forward declarations



HRESULT InitTextures();

void RunKernels();

void DrawScene();

void Cleanup();

void Render();


int main(int argc, char* argv[])


	// create window

	// Register the window class

	WNDCLASSEX wc = { sizeof(WNDCLASSEX), CS_CLASSDC, MsgProc, 0L, 0L,

  GetModuleHandle(NULL), NULL, NULL, NULL, NULL,

  "CUDA Raytracing Test", NULL };

	RegisterClassEx( &wc );

	// Create the application's window

	HWND hWnd = CreateWindow( wc.lpszClassName, "CUDA Raytracing Test",

  WS_OVERLAPPEDWINDOW, 0, 0, g_WindowWidth, g_WindowHeight,

  NULL, NULL, wc.hInstance, NULL );

	ShowWindow(hWnd, SW_SHOWDEFAULT);


	// Initialize Direct3D

	if( SUCCEEDED( InitD3D(hWnd) ) &&	SUCCEEDED( InitTextures() ) )


  // register the Direct3D resources that we'll use

  // we'll read to and write from g_texture_2d, so don't set any special map flags for it

  #ifndef __DEVICE_EMULATION__

  	cudaD3D9RegisterResource(g_texture_2d.pTexture, cudaD3D9RegisterFlagsNone);

  	CUT_CHECK_ERROR("cudaD3D9RegisterResource (g_texture_2d) failed");

 	// Initialize this texture to be black


    cudaD3D9MapResources (1, (IDirect3DResource9 **)&g_texture_2d.pTexture);

    void* data;

    size_t size;

    cudaD3D9ResourceGetMappedPointer(&data, g_texture_2d.pTexture, 0, 0);

    cudaD3D9ResourceGetMappedSize(&size, g_texture_2d.pTexture, 0, 0);

    cudaMemset(data, 0, size);

    cudaD3D9UnmapResources (1, (IDirect3DResource9 **)&g_texture_2d.pTexture);



  	void* data;

  	int pitch;

  	if(Lock(0, D3DLOCK_DISCARD, &data, &pitch) == FALSE)

    return 0;



And then later on I do this: (I don’t know if this needs to be locked as well)


//! Run the Cuda part of the computation


void RunKernels()


	static float t = 0.0f;

	// populate the 2d texture


  void* pData;

  size_t pitch;

  cudaD3D9ResourceGetMappedPointer(&pData, g_texture_2d.pTexture, 0, 0);

  cudaD3D9ResourceGetMappedPitch(&pitch, NULL, g_texture_2d.pTexture, 0, 0);

  runTest(pData, g_texture_2d.width, g_texture_2d.height, pitch, t);



	t += 0.01;


Oops, just saw the declaration of rt. But would still be grateful if you could point out what I need to do to my code lol cos it’s a bit different to yours. Do i need to change it a bit so I can still lock the texture? It’s still not working even though I have upgraded the SDK/Toolkit and drivers

Can anyone help me?