bandwidthTest anomaly!

sunil · July 28, 2009, 12:56am

Hi,

I found a potential anomaly in the bandwidth test returned by Cuda application bandwidthTest. This is running on my mac pro 8-core (previous generation, not nehalem) with 8800 GT card. The host-to-device bandwidth is 10 times slower than device-to-host! Am I missing something? I thought PCIe bandwidth would be symmetric, or at least the read-back would be slower, and not the other way around.

I am trying out Cuda 2.3 sdk on OSX 10.5.7 with 2.3 Cuda driver.

Regards

Sunil

[codebox]~/GPU Computing/C/bin/darwin/release $ ./bandwidthTest

Running on…

  device 0:GeForce 8800 GT

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 189.2

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1551.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 47118.6

&&&& Test PASSED

Press ENTER to exit…

~/GPU Computing/C/bin/darwin/release $

[/codebox]

seibert · July 28, 2009, 7:19pm

How do things look when you run:

./bandwidthTest --memory=pinned

sunil · July 29, 2009, 4:28am

Hi,

It’s interesting, with --memory=pinned I get better results, in both forward and backward direction. Can you throw some more light?

[codebox]~/GPU Computing/C/bin/darwin/release $ ./bandwidthTest --memory=pinned

Running on…

  device 0:GeForce 8800 GT

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5588.7

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 4374.4

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 47122.8

&&&& Test PASSED[/codebox]

seibert · July 30, 2009, 4:20pm

Pinned memory is a chunk of host memory which has been marked by the CUDA driver as “unmovable” to the operating system. The OS is not allowed to relocate the memory (virtual address translation makes this possible without invalidating pointers) or swap the memory block to disk.

This is important because memory transfer between the host and the CUDA device is done with a DMA transaction. This requires the memory block on the host to have a fixed physical address. If you do not use pinned memory, CUDA instead DMA transfers a block of data (possibly smaller than your request) to a private pinned memory location inside the driver, then copies that data into your non-pinned memory block. This process repeats until your entire requested memory transfer is complete. The overhead of two copies makes non-pinned (aka “pageable memory”) transfers much slower than pinned memory transfers on many systems, usually about half speed. (The one exception to this are the triple channel Core i7 systems, which have so much memory bandwidth that you barely notice the difference between pinned and pageable memory at all.)

Your extremely poor host-to-device bandwidth in the pageable case suggests there is something very wrong with the two-stage copy process on your system. I don’t know what would cause that. However, if you can used pinned memory as a workaround until you can figure that out, you’ll be fine. (And your memory transfers will be faster on other systems as well.)

The CUDA programming guide has more info on pinned (also called “page-locked”) memory.

wfrisby · July 31, 2009, 3:31pm

I’m getting the same thing on my Mac Pro with OS X 10.5.7 and CUDA 2.3 and with the 2.3 driver. This seems to be a bug since my Mac Book Pro (8600M) with CUDA 2.1 gets 10x better performance for Host to Device paged memory.

Here is what my Mac Book Pro shows.

[codebox]Running on…

 device 0:GeForce 8600M GT

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1132.4

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1006.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 9646.9

&&&& Test PASSED

Press ENTER to exit…

[/codebox]

Topic		Replies	Views
bandwidthtest: pageable vs pinned memory CUDA Programming and Performance	4	1730	February 18, 2010
Pinned and Pageable memory CUDA Programming and Performance	5	2595	January 16, 2020
Data Transfers Optimization aka Pinned Host Memory utilization CUDA Programming and Performance	6	662	December 17, 2021
Strange results with pinned memory Pinned memory stopped "working" CUDA Programming and Performance	2	4427	November 19, 2007
pageable and non-pageable memory CUDA Programming and Performance	2	6454	December 31, 2008
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10088	June 30, 2009
Memory bandwidth too high? CUDA Programming and Performance	0	3588	December 4, 2007
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3679	April 25, 2008
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	11040	October 16, 2010
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2919	June 24, 2016

bandwidthTest anomaly!

Related topics