page-locked memory: alignment? reason: inconsistent results for memcopy

Hi folks!

I’m wondering what the requirements are to make (most) effective use of page-locked memory in terms of size/alignment of that memory.
Does the size of the (1-D) vector in question has to be a multiple of something (like the page size) in order to result in maximum bandwidth?
Oh, and I’m talking about host to device transfers, NOT about accessing global device memory within the kernel.

Let me explain my question: I was looking at host->dev memory transfer rates and, of course, tried to get the most out of it by using page-locked memory. Both the bandwidthTest example as well as the actual code I’m working on yield transfer rates of up to 3.2 GB/s. That value, however, varies from run to run and even between individual memory sizes within a single “range-mode”-run of bandwidthTest. In fact, the staurated bandwidth (i.e. for memsize>=10MB) varies between two distinct “levels” (3200MB/s and 3020MB/s) . I attached a sample output of bandwithTest (20e6…1000e6 bytes, delta=20e6 - bw_pinned_singlerun.gif[attachment=5923:attachment]). If I repeat this test it turns out that the locations of the “steps” in the curve are NOT systematic, but occur at different memsizes (or not at all).
My first assumption was that this is caused by the memory size not being a multiple of the pagesize (4096 byte) but even if I ensure this, I get corresponding results. I guess, I I’m still mising something. Does anybody have a helpful comment on that?
BTW: I’m using a C870 in an HP wx9400 running RHEL 4u5 64bit, CUDA TK v. 1.1, driver v. 169.09

Thanks, Alex
bw_pinned_singlerun.gif

Interesting results. I don’t have any answers for you, but one thought I had is that you might be seeing overhead from the windowing system (i.e. moving the mouse). Have you tried killing X windows and running your benchmark from the text console?

Actually, I ran the tests on a remote machine, being logged onto using ssh. The X-Server was idle (or at least not actively used) while I carried out my tests. I’ll try to repeat the tests in runlevel 3.

Oh, and I think I have to shift the focus of my initial question a little bit: I was able to reproduce the bandwith fluctuation on pageable memory as well. There, I see ~1940MBs and 1740 MB/s, respectively (again, non-systematic) so almost the same delta of 200 MB/s.

I.e., it can’t be just the pinned memory alone. Although, in the end, memcopys of pageable memory use dma-transfers as well - after memcopying to page-locked memory, right? So, might there be some alignment requirement on the dma-transfer side?

Of course, It might be that my cuda-based PCIbus utilization conflicts with something else that was silently running on the machine (although it appeared to be idle) but then, why do I observe only two discrete bandwidth-values? Strange.

Alex

The 9400 is an Opteron system.
How many processors do you have?
Try to use “taskset -c 1 ./bandwidth” (or numactl) to pin the process to the processor, the OS may move your process around and one of the Opteron has fastest access to PCI-e.

Duh! Yes, that was the problem (and the solution). Binding the process to a specific cpu (cpu0 in my case) yields a constant value of 3200MB/s host->dev.

Thanks a lot!

Hrrm, that’s the problem with having too many resources (and not knowing how to use them properly).