page-locked memory: alignment? reason: inconsistent results for memcopy

Alex_Loddoch · March 18, 2008, 6:29pm

Hi folks!

I’m wondering what the requirements are to make (most) effective use of page-locked memory in terms of size/alignment of that memory.
Does the size of the (1-D) vector in question has to be a multiple of something (like the page size) in order to result in maximum bandwidth?
Oh, and I’m talking about host to device transfers, NOT about accessing global device memory within the kernel.

Let me explain my question: I was looking at host->dev memory transfer rates and, of course, tried to get the most out of it by using page-locked memory. Both the bandwidthTest example as well as the actual code I’m working on yield transfer rates of up to 3.2 GB/s. That value, however, varies from run to run and even between individual memory sizes within a single “range-mode”-run of bandwidthTest. In fact, the staurated bandwidth (i.e. for memsize>=10MB) varies between two distinct “levels” (3200MB/s and 3020MB/s) . I attached a sample output of bandwithTest (20e6…1000e6 bytes, delta=20e6 - bw_pinned_singlerun.gif[attachment=5923:attachment]). If I repeat this test it turns out that the locations of the “steps” in the curve are NOT systematic, but occur at different memsizes (or not at all).
My first assumption was that this is caused by the memory size not being a multiple of the pagesize (4096 byte) but even if I ensure this, I get corresponding results. I guess, I I’m still mising something. Does anybody have a helpful comment on that?
BTW: I’m using a C870 in an HP wx9400 running RHEL 4u5 64bit, CUDA TK v. 1.1, driver v. 169.09

Thanks, Alex

MisterAnderson42 · March 18, 2008, 6:58pm

Interesting results. I don’t have any answers for you, but one thought I had is that you might be seeing overhead from the windowing system (i.e. moving the mouse). Have you tried killing X windows and running your benchmark from the text console?

Alex_Loddoch · March 18, 2008, 7:32pm

Actually, I ran the tests on a remote machine, being logged onto using ssh. The X-Server was idle (or at least not actively used) while I carried out my tests. I’ll try to repeat the tests in runlevel 3.

Oh, and I think I have to shift the focus of my initial question a little bit: I was able to reproduce the bandwith fluctuation on pageable memory as well. There, I see ~1940MBs and 1740 MB/s, respectively (again, non-systematic) so almost the same delta of 200 MB/s.

I.e., it can’t be just the pinned memory alone. Although, in the end, memcopys of pageable memory use dma-transfers as well - after memcopying to page-locked memory, right? So, might there be some alignment requirement on the dma-transfer side?

Of course, It might be that my cuda-based PCIbus utilization conflicts with something else that was silently running on the machine (although it appeared to be idle) but then, why do I observe only two discrete bandwidth-values? Strange.

Alex

mfatica · March 18, 2008, 8:07pm

The 9400 is an Opteron system.
How many processors do you have?
Try to use “taskset -c 1 ./bandwidth” (or numactl) to pin the process to the processor, the OS may move your process around and one of the Opteron has fastest access to PCI-e.

Alex_Loddoch · March 18, 2008, 9:01pm

Duh! Yes, that was the problem (and the solution). Binding the process to a specific cpu (cpu0 in my case) yields a constant value of 3200MB/s host->dev.

Thanks a lot!

Hrrm, that’s the problem with having too many resources (and not knowing how to use them properly).

Topic		Replies	Views
page-locked memory CUDA Programming and Performance	4	11417	November 10, 2008
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	11047	October 16, 2010
About pinned memory and its effectiveness CUDA Programming and Performance	3	1402	April 15, 2009
pageable and non-pageable memory CUDA Programming and Performance	2	6466	December 31, 2008
Pinned and Pageable memory CUDA Programming and Performance	5	2610	January 16, 2020
slower transfer time if host memory is not set? why? CUDA Programming and Performance	3	664	May 11, 2018
Data Transfers Optimization aka Pinned Host Memory utilization CUDA Programming and Performance	6	669	December 17, 2021
bandwidthTest anomaly! CUDA Programming and Performance	4	10942	July 31, 2009
question about page locked memory CUDA Programming and Performance	2	9116	April 21, 2009
Transfer Speed For AWE-Allocated Memory CUDA Programming and Performance	6	3044	March 20, 2013

page-locked memory: alignment? reason: inconsistent results for memcopy

Related topics