I’m wondering what the requirements are to make (most) effective use of page-locked memory in terms of size/alignment of that memory.
Does the size of the (1-D) vector in question has to be a multiple of something (like the page size) in order to result in maximum bandwidth?
Oh, and I’m talking about host to device transfers, NOT about accessing global device memory within the kernel.
Let me explain my question: I was looking at host->dev memory transfer rates and, of course, tried to get the most out of it by using page-locked memory. Both the bandwidthTest example as well as the actual code I’m working on yield transfer rates of up to 3.2 GB/s. That value, however, varies from run to run and even between individual memory sizes within a single “range-mode”-run of bandwidthTest. In fact, the staurated bandwidth (i.e. for memsize>=10MB) varies between two distinct “levels” (3200MB/s and 3020MB/s) . I attached a sample output of bandwithTest (20e6…1000e6 bytes, delta=20e6 - bw_pinned_singlerun.gif[attachment=5923:attachment]). If I repeat this test it turns out that the locations of the “steps” in the curve are NOT systematic, but occur at different memsizes (or not at all).
My first assumption was that this is caused by the memory size not being a multiple of the pagesize (4096 byte) but even if I ensure this, I get corresponding results. I guess, I I’m still mising something. Does anybody have a helpful comment on that?
BTW: I’m using a C870 in an HP wx9400 running RHEL 4u5 64bit, CUDA TK v. 1.1, driver v. 169.09