Poor performance on Vista Kernel runs slower on Vista than WinXP

Hi,

I’ve written a kernel to calculate a hologram using the Gerchberg-Saxton algorithm (an iterative Fourier transform algorithm). On Windows XP the kernel computes the hologram in 4 us, but on Windows Vista it takes 11 us, almost 3 times slower! :thumbsdown:

Does anyone have any idea why this could be? Is it a problem of the driver, the compiler or CUDA?

I’m using CUDA 2.1 beta on a GeForce 8800 GTS, using nVidia drivers 181.20 and compiling the program with VS 2008.

Can someone please help?

Are you including the cost of context creation? Other calls? Thanks to the Vista driver model, the overhead of a single kernel launch or system call is greater under Vista than under XP. If you want to send me a repro, I can investigate other potential causes.

It may also be a timing issue… are you sure you’re syncronizing threads so you know the compute is complete before closing the timing?

An EMPTY kernel which does no work usually takes about about 10us.

Thank you for your advice. Here is what I do:

[codebox]////////////////////////////////////////////////////////////////////////////////

// Launch the CUDA kernels to fill in the texture data

////////////////////////////////////////////////////////////////////////////////

void Render()

{

// Draw the text window

DrawTextWin();

//

// map the resources we've registered so we can access them in CUDA

//

CUDA_SAFE_CALL( cudaD3D9MapResources(1, (IDirect3DResource9 **)&g_pTexture) );

CUDA_SAFE_CALL( cudaD3D9MapResources(1, (IDirect3DResource9 **)&g_pText) );

////

//// run the kernel which will populate the contents of that texture

////

void *pCGH;

CUDA_SAFE_CALL( cudaD3D9ResourceGetMappedPointer(&pCGH, g_pTexture, 0, 0) );

size_t pitchCGH = 0;

CUDA_SAFE_CALL( cudaD3D9ResourceGetMappedPitch(&pitchCGH, NULL, g_pTexture, 0, 0) );

void *pText;

CUDA_SAFE_CALL( cudaD3D9ResourceGetMappedPointer(&pText, g_pText, 0, 0) );

size_t pitchText = 0, size = 0;

CUDA_SAFE_CALL( cudaD3D9ResourceGetMappedPitch(&pitchText, NULL, g_pText, 0, 0) );

CUDA_SAFE_CALL( cudaD3D9ResourceGetMappedSize(&size, g_pText, 0, 0) );



// Start timer

CUT_SAFE_CALL ( cutResetTimer(g_handles.hTimer)   );

CUT_SAFE_CALL ( cutStartTimer(g_handles.hTimer)   );

//

// Calculate the CGH

//

calcCGH(pText, size, pitchText, pCGH, pitchCGH);

// Display time elapsed

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStopTimer(g_handles.hTimer) );

double gpuTime = cutGetTimerValue(g_handles.hTimer);

fprintf(stderr,"GPU time: %f msecs.\n", gpuTime);

//

// unmap the resources

//

CUDA_SAFE_CALL( cudaD3D9UnmapResources(1, (IDirect3DResource9 **)&g_pTexture) );

CUDA_SAFE_CALL( cudaD3D9UnmapResources(1, (IDirect3DResource9 **)&g_pText) );

//

// draw the scene using them

//

DrawCGH();

// Compute FPS

computeFPS();

}[/codebox]

First I draw a window with some text, then I map some D3D resources, then I start the timer and calculate the hologram. calcCGH is the CUDA kernel which sints on an external .cu file. When that is finished, I stop the timer, unmap the D3D resources and draw the hologram.

The timings for WinXP are:

GPU time: 3.567568 msecs.

GPU time: 3.582057 msecs.

GPU time: 3.636186 msecs.

GPU time: 3.579191 msecs.

GPU time: 3.591956 msecs.

And for Vista:

GPU time: 9.400078 msecs.

GPU time: 9.366274 msecs.

GPU time: 9.629995 msecs.

GPU time: 9.704026 msecs.

GPU time: 9.291684 msecs.

Any advice?

After some tests I could determine that the culprits are: cudaMalloc(), cudaMemcpy(), and cudaFree(). These functions caused my algorithm to run 3 times slower in Vista than in XP. After moving these functions out of the loop, the algorithm performed very similar in both platforms, though still a bit slower in Vista (I guess this must be the difference in overhead of the kernel functions).

I will do some more tests and report back with more details and timings.

yeah, I’ve seen similar performance characteristics with Vista. every call that has to go through the WDDM interface takes a lot longer than on XP or Linux. OSX has similar issues as well.

Thank you for the comment. Is this going to be addressed at some point, or is there no way around?

I have done a test program (attached) to mesure the timings of these functions. Compiling and running the code in XP, the execution takes 0.30 ms, while on Vista it takes 1.3 ms (~4 times more).

Taking cudaMalloc() and cudaFree() out of the loop like this:

[codebox]…

unsigned int timer = 0;

CUT_SAFE_CALL( cutCreateTimer( &timer));

CUDA_SAFE_CALL( cudaMalloc((void**) &d_image, sizeof(float) * size));

for( int i = 0; i < 10000; i++ ) {

	CUT_SAFE_CALL( cutStartTimer( timer));

	CUDA_SAFE_CALL( cudaMemcpy(d_image, h_image, sizeof(float) * size, cudaMemcpyHostToDevice));

	compPixel<<<52, 128>>>(d_image, 1.f, 81, 81);

	CUT_CHECK_ERROR("compPixel() execution failed\n");

	CUDA_SAFE_CALL( cudaMemcpy(h_image, d_image, sizeof(float) * size, cudaMemcpyDeviceToHost));

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	CUT_SAFE_CALL( cutStopTimer( timer));

}

CUDA_SAFE_CALL( cudaFree(d_image));

printf("Processing time: %f (ms)\n", cutGetAverageTimerValue( timer));

…[/codebox]

the execution on XP takes 0.08 ms, while on Vista it takes 0.25 ms (~3 times more).

Taking cudaMemcpy() out of the loop like this:

[codebox]…

unsigned int timer = 0;

CUT_SAFE_CALL( cutCreateTimer( &timer));

CUDA_SAFE_CALL( cudaMalloc((void**) &d_image, sizeof(float) * size));

CUDA_SAFE_CALL( cudaMemcpy(d_image, h_image, sizeof(float) * size, cudaMemcpyHostToDevice));

for( int i = 0; i < 10000; i++ ) {

	CUT_SAFE_CALL( cutStartTimer( timer));

	compPixel<<<52, 128>>>(d_image, 1.f, 81, 81);

	CUT_CHECK_ERROR("compPixel() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	CUT_SAFE_CALL( cutStopTimer( timer));

}

CUDA_SAFE_CALL( cudaMemcpy(h_image, d_image, sizeof(float) * size, cudaMemcpyDeviceToHost));

CUDA_SAFE_CALL( cudaFree(d_image));

printf("Processing time: %f (ms)\n", cutGetAverageTimerValue( timer));

…[/codebox]

the execution takes 0.03 ms on XP, and 0.06 ms on Vista (~2 times more). So still just running a kernel function takes ~2 time more on Vista. I guess this is only overhead time, as the test kernel is very simple and the image is quite small (81 x 81).
sample.zip (1.23 KB)

This is how Vista was designed by Microsoft. There is no workaround if you intend to use Vista (or Windows7).

DrPhoton,

Did you make the GS code yourself or did you find it somewhere? Are you willing to share it? I’d like to use it for some SLM hologram generation. I don’t want to write new code if I don’t have to. I would acknowledge the use of it.

Yes, I wrote the code myself, but I’m afraid I can’t share it, as it is copyrighted. However, it’s very easy to port the algorithm to CUDA using CUFFT.

I may be able to give you a hand if you get stucked :thumbup: