bandwidthTest anomaly!

Hi,

I found a potential anomaly in the bandwidth test returned by Cuda application bandwidthTest. This is running on my mac pro 8-core (previous generation, not nehalem) with 8800 GT card. The host-to-device bandwidth is 10 times slower than device-to-host! Am I missing something? I thought PCIe bandwidth would be symmetric, or at least the read-back would be slower, and not the other way around.

I am trying out Cuda 2.3 sdk on OSX 10.5.7 with 2.3 Cuda driver.

Regards

Sunil

[codebox]~/GPU Computing/C/bin/darwin/release $ ./bandwidthTest

Running on…

  device 0:GeForce 8800 GT

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 189.2

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1551.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 47118.6

&&&& Test PASSED

Press ENTER to exit…

~/GPU Computing/C/bin/darwin/release $

[/codebox]

How do things look when you run:

./bandwidthTest --memory=pinned

Hi,

It’s interesting, with --memory=pinned I get better results, in both forward and backward direction. Can you throw some more light?

[codebox]~/GPU Computing/C/bin/darwin/release $ ./bandwidthTest --memory=pinned

Running on…

  device 0:GeForce 8800 GT

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5588.7

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 4374.4

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 47122.8

&&&& Test PASSED[/codebox]

Pinned memory is a chunk of host memory which has been marked by the CUDA driver as “unmovable” to the operating system. The OS is not allowed to relocate the memory (virtual address translation makes this possible without invalidating pointers) or swap the memory block to disk.

This is important because memory transfer between the host and the CUDA device is done with a DMA transaction. This requires the memory block on the host to have a fixed physical address. If you do not use pinned memory, CUDA instead DMA transfers a block of data (possibly smaller than your request) to a private pinned memory location inside the driver, then copies that data into your non-pinned memory block. This process repeats until your entire requested memory transfer is complete. The overhead of two copies makes non-pinned (aka “pageable memory”) transfers much slower than pinned memory transfers on many systems, usually about half speed. (The one exception to this are the triple channel Core i7 systems, which have so much memory bandwidth that you barely notice the difference between pinned and pageable memory at all.)

Your extremely poor host-to-device bandwidth in the pageable case suggests there is something very wrong with the two-stage copy process on your system. I don’t know what would cause that. However, if you can used pinned memory as a workaround until you can figure that out, you’ll be fine. (And your memory transfers will be faster on other systems as well.)

The CUDA programming guide has more info on pinned (also called “page-locked”) memory.

I’m getting the same thing on my Mac Pro with OS X 10.5.7 and CUDA 2.3 and with the 2.3 driver. This seems to be a bug since my Mac Book Pro (8600M) with CUDA 2.1 gets 10x better performance for Host to Device paged memory.

Here is what my Mac Book Pro shows.

[codebox]Running on…

 device 0:GeForce 8600M GT

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1132.4

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1006.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 9646.9

&&&& Test PASSED

Press ENTER to exit…

[/codebox]