Performance difference of CUDA in Windows and Linux

According to CUDA-Z, I found that there seems to be a substantial performance difference between the ‘Device to Device’ memory copy speed in Windows 64bit and Linux 64bit. According to the screenshots below, Windows is nearly 6x faster in ‘Device to Device’ copy. What exactly is a ‘Device to Device’ copy? Is that just memory movement withing the memory of the graphics card?

Also are there other substantial performance differences between CUDA running on Windows verus Linux?
Screenshot_CUDA_Z_0.5.95.png
win7_64cudaz.png

That Linux device-to-device speed is way too low for the card you have. Either the measurement is wrong, or you have some kind of driver problem. What driver version are you running?

I’m currently running the 195.36.15 drivers in Ubuntu 9.10 x86_64, they were downloaded from the main nvidia site and not the ‘Developer Drivers for Linux (195.36.15)’ found on the CUDA website.

Here’s the ouput to the bandwidthTest program in the SDK while running in my linux system:

Running on......

	  device 0:GeForce 8800 GTS

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		1645.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		1468.4

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		10080.2

&&&& Test PASSED

Press ENTER to exit...

Can you also run the deviceQuery application? I’m wondering if your memory clock is stuck in some kind of low power mode.

Sure.

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 8800 GTS"

  CUDA Driver Version:						   3.0

  CUDA Runtime Version:						  2.30

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 0

  Total amount of global memory:				 670367744 bytes

  Number of multiprocessors:					 14

  Number of cores:							   112

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  2147483647 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.35 GHz

  Concurrent copy and execution:				 No

  Run time limit on kernels:					 Yes

  Integrated:									No

  Support host page-locked memory mapping:	   No

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit...

Does anyone have any idea what the bandwidth values should be?

I have a 8800 GTX and get approx the same low device to device bandwidth you do on linux.

(by the way you should have both the driver and runtime as 3.0).

I also have a 480 GTX, and the windows and linux rates are matching.

linuxcuda.png

Does anyone know why there’s such a substantial difference in device to device bandwidth between Windows and Linux?

I have a 260 GTX, and the windows64 and linux64 rates are matching.
CUDA_Z_0.5.95_linux64bit.png
CUDA_Z_0.5.95_windows64bit.jpg

CUDA_Z_0.5.95_windows64bit.jpg

I have a 260 GTX, and the windows64 and linux64 rates are matching.

Here are the CUDA-Z and bandwidth test results for the same system running Windows XP 32bit with the latest officially released drivers (197.45):

bandwidth test:

Running on......

	  device 0:GeForce 8800 GTS

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1586.6

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1595.0

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   10142.6

&&&& Test PASSED

Press ENTER to exit...

These results appear to match the results I get in CUDA-Z running Linux 64-bit which I show in the first post. So I’m not sure what to make of these numbers. I doubt both my Windows XP 32 bit system and my Linux 64 bit system suffer from the same Device to Device bandwidth problem, but I still can’t explain why the Device to Device bandwidth is 6x higher in Windows 7 64-bit.
cudaz_winxp_19745.PNG

Here are the CUDA-Z and bandwidth test results for the same system running Windows XP 32bit with the latest officially released drivers (197.45):

bandwidth test:

Running on......

	  device 0:GeForce 8800 GTS

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1586.6

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1595.0

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   10142.6

&&&& Test PASSED

Press ENTER to exit...

These results appear to match the results I get in CUDA-Z running Linux 64-bit which I show in the first post. So I’m not sure what to make of these numbers. I doubt both my Windows XP 32 bit system and my Linux 64 bit system suffer from the same Device to Device bandwidth problem, but I still can’t explain why the Device to Device bandwidth is 6x higher in Windows 7 64-bit.