Raycasting performance on GPU

Ok, i’m finishing my raycaster, but i’m curious about performance other of You get on this field.
My results so far are (for ‘sponza’ model http://hdri.cgtechniques.com/~sponza/files/):

GFX card: NVIDIA GeForce GTX 280 @ 1.3 GHZ
KDTree construction with SAH (depth = 20) → 1.3 sec (with is quite ok compared to optimized CPU solution)

Average performance of traversing → 2.7 MRays / second

IMHO the performance of traversal is poor ;(
Currently i’m using push-down with shortstack traversal algorithm.
Stack is 8 items per thread, this gives 192 threads per block (shared memory size is soo small), each thread calculates one ray … and the whole thing suffers a lot from warp divergences External Image
Packetized traversal eats to many registers and at the end result is worse
(i’v experimented with 2x2 packets)

You might want to check out this guy’s implementation, he has some good tips for CUDA optimization:
[url=“bouliiii's blog: Cuda real time ray tracing - 100 millions ray/s?”]http://bouliiii.blogspot.com/2008/08/real-...h-cuda-100.html[/url]

NVIDIA demoed a raytracer at NVISION (and other events) that uses BVH (possibly with KD-trees in the BVH btw). As far as I remember, they also do packet traversal.

No, I believe it’s a simple thread-per-ray.

The presentation is here:
[url=“http://developer.nvidia.com/object/nvision08-IRT.html”]http://developer.nvidia.com/object/nvision08-IRT.html[/url]

I think you are right. I had in my mind they tried to group threads (read rays) that follow comparable paths together in blocks, but that seems not correct indeed.

after some tweaking & slashing i’v went from 2.7 MRays / second to ~12 MRays / second
(i’v reduced the number of if’s blocks to the minimum, some work is done redundantly now, but the divergence is much lower)
another thing i’v discowered is almost 100% texture cache miss when sampling KDTree branches, now i’m trying to rearange nodes table to be more cache friendly :)

for the sake of experiment i’v set camera fov to 1.0 (all rays should take almost that same path – no divergences) and in this case performance was ~70 MRays / second

all tests are done looking from the corner of sponza scene, so the whole atrium is visible to the camera.

Hi,

I am also starting a ray tracing project and would like to know how you debug your CUDA code. I’ve tried setting up emuDebug, but in the C++ code, when setting up the D3D texture I get an error message saying “this feature is not yet implemented”. I also noticed in the CUDA SDK that the other D3D texture examples didn’t have any debug builds.

Please let me know how you handled this,

Thanks,

Seb

Not yet implemented usually means you have a bad toolkit/driver combo. Some of the stuff in the toolkit isnt in the driver.
Go to the cuda download page and download the most recent stuff there if you can.

with the latest driver & toolkit (i’m using x64 windows vista version) there is no problem with binding memory to textures on emu.

the other side of the stick is that some CUDA vs. DX9 interops do not work
(cudaD3D9(Register/Map)Resource and you need to emulate this by manualy locking the texture to get pointer and bind is as texture or pass to the kernel)
but that’s only few additional lines of code so you could live with this :)

Ok, thanks guys. DarkAr, any chance you could give an example of how to lock texture manually? I’m pretty much a noob when it come to CUDA programming at the mo!

here you go:

void c_KDTREE_GPU::RayCastGPU(c_D3DTEXTURE *rt)

{

#ifndef __DEVICE_EMULATION__

 IDirect3DResource9 *rttex = (IDirect3DResource9*)rt->GetTexture();

 if (cudaD3D9MapResources(1, &rttex) != cudaSuccess)

 Â freaked_error("Cannot map Cuda resource !");

void  *TexData;

 size_t TexPitch;

 cudaD3D9ResourceGetMappedPointer(&TexData, rttex, 0, 0);

 cudaD3D9ResourceGetMappedPitch  (&TexPitch, NULL, rttex, 0, 0);

#else

 void *TexData;

 int   TexPitch;

if (rt->Lock(0, D3DLOCK_DISCARD, &TexData, &TexPitch) == FALSE)

 Â return;

#endif

// .... call kernel

#ifndef __DEVICE_EMULATION__

 if (cudaD3D9UnmapResources(1, &rttex) != cudaSuccess)

 Â freaked_error("Cannot unmap Cuda resource !");

#else

 rt->Unlock(0);

#endif

return;

}

in other words in emulation mode you need to obtain pointers to dx9 object yourself ;)

Hello, thanks for your reply, but I’m still having trouble, do you think you could tell me what rt is? I don’t know which of my members has Lock(). Here’s my code:

bool g_bDone = false;

IDirect3D9  * g_pD3D; // Used to create the D3DDevice

IDirect3DDevice9* g_pD3DDevice;

const unsigned int g_WindowWidth = 512;

const unsigned int g_WindowHeight = 512;

// Data structure for 2D texture shared between DX9 and CUDA

struct

{

	IDirect3DTexture9* pTexture;

	int width;

	int height;	

} g_texture_2d;

// The CUDA kernel launchers that get called

extern "C" 

{

	void runTest(void* surface, size_t width, size_t height, size_t pitch, float t);

}

//-----------------------------------------------------------------------------

// Forward declarations

//-----------------------------------------------------------------------------

HRESULT InitD3D( HWND hWnd );

HRESULT InitTextures();

void RunKernels();

void DrawScene();

void Cleanup();

void Render();

LRESULT WINAPI MsgProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam);

int main(int argc, char* argv[])

{

	// create window

	// Register the window class

	WNDCLASSEX wc = { sizeof(WNDCLASSEX), CS_CLASSDC, MsgProc, 0L, 0L,

  GetModuleHandle(NULL), NULL, NULL, NULL, NULL,

  "CUDA Raytracing Test", NULL };

	RegisterClassEx( &wc );

	// Create the application's window

	HWND hWnd = CreateWindow( wc.lpszClassName, "CUDA Raytracing Test",

  WS_OVERLAPPEDWINDOW, 0, 0, g_WindowWidth, g_WindowHeight,

  NULL, NULL, wc.hInstance, NULL );

	ShowWindow(hWnd, SW_SHOWDEFAULT);

	UpdateWindow(hWnd);

	// Initialize Direct3D

	if( SUCCEEDED( InitD3D(hWnd) ) &&	SUCCEEDED( InitTextures() ) )

	{

  // register the Direct3D resources that we'll use

  // we'll read to and write from g_texture_2d, so don't set any special map flags for it

  #ifndef __DEVICE_EMULATION__

  	cudaD3D9RegisterResource(g_texture_2d.pTexture, cudaD3D9RegisterFlagsNone);

  	CUT_CHECK_ERROR("cudaD3D9RegisterResource (g_texture_2d) failed");

 	// Initialize this texture to be black

  	{

    cudaD3D9MapResources (1, (IDirect3DResource9 **)&g_texture_2d.pTexture);

    void* data;

    size_t size;

    cudaD3D9ResourceGetMappedPointer(&data, g_texture_2d.pTexture, 0, 0);

    cudaD3D9ResourceGetMappedSize(&size, g_texture_2d.pTexture, 0, 0);

    cudaMemset(data, 0, size);

    cudaD3D9UnmapResources (1, (IDirect3DResource9 **)&g_texture_2d.pTexture);

  	}

  #else

  	void* data;

  	int pitch;

  	if(Lock(0, D3DLOCK_DISCARD, &data, &pitch) == FALSE)

    return 0;

  #endif

	}

And then later on I do this: (I don’t know if this needs to be locked as well)

////////////////////////////////////////////////////////////////////////////////

//! Run the Cuda part of the computation

////////////////////////////////////////////////////////////////////////////////

void RunKernels()

{

	static float t = 0.0f;

	// populate the 2d texture

	{

  void* pData;

  size_t pitch;

  cudaD3D9ResourceGetMappedPointer(&pData, g_texture_2d.pTexture, 0, 0);

  cudaD3D9ResourceGetMappedPitch(&pitch, NULL, g_texture_2d.pTexture, 0, 0);

  runTest(pData, g_texture_2d.width, g_texture_2d.height, pitch, t);

  

	}

	t += 0.01;

}

Oops, just saw the declaration of rt. But would still be grateful if you could point out what I need to do to my code lol cos it’s a bit different to yours. Do i need to change it a bit so I can still lock the texture? It’s still not working even though I have upgraded the SDK/Toolkit and drivers

Can anyone help me?