bandwidthTest results on an FX4600

I’m new to the CUDA scene, and am trying to discover what memory datarates I can
achieve on an FX4600. I’m using the bandwidthTest example to do this, but seem to
get rather low datarates. I’m logged in remotely to the test host, so the FX4600
isnot being used graphically.

I’ve run the test with and without pinned memory, and in --range mode:

tsh> ./bandwidthTest --mode=range --start=2079512 --end=33554432 --increment=2079512
Range Mode
Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 602.4
4159024 642.0
6238536 657.5
8318048 663.7
10397560 667.4
12477072 671.9
14556584 672.6
16636096 674.6
18715608 677.6
20795120 678.2
22874632 678.7
24954144 679.4
27033656 680.6
29113168 681.0
31192680 680.8
33272192 681.1

Range Mode
Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 665.7
4159024 738.4
6238536 762.8
8318048 778.0
10397560 785.8
12477072 792.4
14556584 798.1
16636096 799.8
18715608 803.2
20795120 804.8
22874632 807.4
24954144 808.7
27033656 809.6
29113168 811.0
31192680 811.1
33272192 812.1

Range Mode
Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 24992.8
4159024 25490.7
6238536 26787.6
8318048 26986.6
10397560 27129.6
12477072 27201.0
14556584 26242.4
16636096 27325.9
18715608 27090.5
20795120 27422.3
22874632 27426.4
24954144 27439.3
27033656 27495.7
29113168 25253.1
31192680 27517.4
33272192 27480.9

&&&& Test PASSED

Press ENTER to exit…

tsh> ./bandwidthTest --memory=pinned --mode=range --start=2079512 --end=33554432 --increment=2079512
Range Mode
Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 740.8 <<< 2MBytes (210241024)
4159024 745.9
6238536 747.4
8318048 748.0
10397560 748.7
12477072 749.0
14556584 749.3
16636096 749.4
18715608 749.5
20795120 749.6
22874632 749.8
24954144 749.8
27033656 749.9
29113168 749.9
31192680 750.0
33272192 750.1

Range Mode
Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 848.9
4159024 855.3
6238536 857.6
8318048 858.5
10397560 859.1
12477072 859.5
14556584 859.9
16636096 860.1
18715608 860.2
20795120 860.4
22874632 860.5
24954144 860.6
27033656 860.6
29113168 860.8
31192680 860.8
33272192 860.9

Range Mode
Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)
2079512 24992.8
4159024 25523.5
6238536 26769.5
8318048 26963.7
10397560 27100.0
12477072 27213.4
14556584 26252.3
16636096 27311.8
18715608 27074.1
20795120 27412.8
22874632 27416.0
24954144 27434.6
27033656 27483.9
29113168 25253.1
31192680 27509.7
33272192 27483.3

&&&& Test PASSED

Press ENTER to exit…

Why do I get such low rates between host and device?
The FX4600 is plugged into a 16-way PCIe slot.
If I do a simple 200MByte memcpy on the same host I get 2GBytes/sec.

Furthermore, if I try and lock those 200MBytes in memory in the memcpy program, it fails because I’m limited to 32Kbyes by the current resource limits. If I log on as root and ‘ulimit’ , the lock then succeeds. So how is the bandwidthTest program managing to do the
lock in user-mode without hitting the 32K limit?

Any help much appreciated.
Terry.

This seems to be the same problem I (and some other people) have. Our videocards seems to be plugged on a 16x port, and linked with a 16x PCIE lane (it’s important to check that), but only have 4x bandwidth.

We were discussing there of that problem … If you have any news, I’m interested. If an Nvidia specialist has an idea, a little help would be appreciated (they’re the PCIE specialists here, I guess :P )

What’s your machine, by the way, tsh ? Mine is a quite old dell precision 370, and another poster also has an “old” dell machine.

Apologies for late reply. My mobo is an Intel D915GHA, and the manual says that the PCIe-16 slot can transfer ‘up to 8GB/sec simultaneously’. By that, I take it to mean in full-duplex mode.
So I dont know why my host/dev transfers are so slow, but I bet it is a PCIe issue.
Wxperiments on a new Dell box (dont know its name - I log in remotely) but also with a FX4600 GPU, give much better results:

lg7_tsh> ./bandwidthTest --mode=range --start=200000000 --end=200000000 --increment=1 --memory=pinned
Range Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
200000000 2918.1

Range Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
200000000 3077.4

Range Mode
Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)
200000000 19972238.0

&&&& Test PASSED

So I think I’ll abandon my D915 lashup. However, I’m still puzlled how the bandwidthTest program manages to lock 200Mbytes into memory in user-mode, when a simple test prog using mlock fails because of resource-limit settings (but works as root), e.g:
g7_tsh> limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 32 kbytes
maxproc 139264
lg7_tsh> ./a.out
mlock source failed: Cannot allocate memory

lg7_tsh> su root
Password:
[tsh@lg7 tesla]# ./a.out
[tsh@lg7 tesla]#

What exactly is cudaMallocHost doing that enables it to breach the above 32Kbyte limit?

Achievable datarates from host to device are a #1 issue for me, so I need to be sure that I really understand what’s going one everywhere…

Cheers,
Terry

Addendum to my prev msg:
I just noticed that the device/device bandwidth on the lg7 machine above, is shown as 19972238.0 MB/sec, i.e. nearly 20Terabytes/sec. Can this possibly be correct???

No. :)