Access data contained in Direct3D9 Surface DirectX Interoperability

WinXP 32-bit

GeForce 8600 GT

169.09 driver

CUDA 1.1

VC++ 2005 Express - SDK projects work

I am trying to transfer the data contained in a Direct3D9 surface into CUDA. I know that CUDA can only interface directly with vertex buffers, but I was wondering if there was a hack that I could use, maybe involving extra memory transfers?

This is possible if I route the copies through the CPU, but this pushes CPU load beyond an acceptable level for my purposes. I want to keep the data in video memory at all stages. Can cudaMemcpy be used as a generic memcpy substitute that executes on the device? I am under the impression that the pointers it accepts reside on the host and point to video memory. Here is sample code I have tried that is failing. g_pd3dSurface is a surface, while g_pVB is a vertex buffer.

D3DLOCKED_RECT lockedRect;

RECT rect1; = g_WindowHeight;

rect1.bottom = 0;

rect1.left = 0;

rect1.right = g_WindowWidth;

// Lock the surface for copying from

g_pd3dSurface->LockRect( &lockedRect, &rect1, D3DLOCK_READONLY );

printf( "%d\n", g_WindowHeight );  //Output is 320

printf( "%d\n", g_WindowWidth );   //Output is 200

printf( "%d\n", lockedRect.Pitch );  //Output is 0

// Lock the vertex buffer for copying to

void *ppDest;

g_pVB->Lock( 0, 0, &ppDest, 0 );

//CUDA_SAFE_CALL( cudaMemcpy( ppDest, lockedRect.pBits, 100 * 100,

//           cudaMemcpyDeviceToDevice ) );




This code is derived from The sizes of both g_pd3dSurface and g_pVB are g_WindowWidth*g_WindowHeight, in texels. I set the cudaMemcpy bounds to be small as a test. The code executes repeatedly and as it should when the CUDA_SAFE_CALL line is commented out (excluding any CUDA code), but returns cudaError_enum at a memory location when the line is executed. Furthermore, once this error occurs, I must restart the computer in order to execute the program again.

I am probably committing a naive error, but is it at all possible to get at the data contained in a D3D9 surface from within CUDA without resorting to CPU memory transfers?