I’m seeing some very strange behavior with cuMemcpy2D/ cuMemcpy3D when copying elements with a pitch that spans unmapped pages in the virtual address space.
I have a two dimensional buffer that spans a virtual address range of width * height = 2 * 3blocks, where each block has the size of one page (i.e., CU_MEM_ALLOC_GRANULARITY_MINIMUM, usually 2 MiB, I think). The address range is partially backed by physical allocations, shown as green above.
I now want to copy parts of said buffer somewhere (doesn’t really matter where, to the host, or another buffer). The copied parts are shown with a thick black outline. Now there’s several different cases:
Case 1 copies the blocks (0, 0) and (0, 1), which aren’t contiguous in memory, however they lie within a span of addresses that is fully mapped to physical allocations. This works as expected.
Cases 3 and 4 copy blocks (0, 0) / (0, 1) and (0, 1) / (0, 2), respectively. While the pitch between blocks is the same as in case 1, there is one block between the two that is not allocated. This does not work; the call to cuMemcpy2D simply fails with CUDA_ERROR_INVALID_VALUE.
Case 2 is where it gets strange: Here we again copy two blocks that are separated by one unmapped page. However it works, as long as block (1, 0) (which is not in the copied range!) is allocated. This makes no sense to me.
You can find a small example program that demonstrates this here.
Given the behavior of cases 1, 3 and 4, I might have assumed this to be a limitation of the memcpy APIs, in that the pitch may not span across unmapped pages (although I hope it is not, as this would severely limit the usefullness of CUDA virtual memory in my opinion). Given that I can seemingly cross an unmapped page however in certain situations, I frankly don’t understand at all what is going on.
PS: On a related note, I’ve also found that the various copy functions will fail with CUDA_ERROR_INVALID_VALUE if the address at srcDevice or dstDevice is not mapped. This can be worked around by manually computing the actual starting pointer of the copy and setting the offsets (srcXInBytes, srcY etc) to zero.
What made you think this should work? Could you point to relevant chapter & verse of the documentation? My expectation would be that such attempts would fail, including your case 2. But maybe I am not up to speed on the latest specifications.
Off-hand, I cannot explain the difference in behavior with regard to case 2. There may be a bug in the functions’ plausibility check, actual page sizes may not be what you think they are, or allocation / mapping may be handled with a granularity differing from page size.
@Robert_Crovella may have better insights into the driver API, so I would suggest waiting for his feedback. At present, I would say filing a bug looks like a good idea. This may at minimum result in clarification being added to the documentation, and could possibly lead to enhancements to the functions’ plausibility check to avoid surprises due to a (seeming) lack of consistency.
Obviously I have no idea how the copy engine works in practice, but I would assume that at some level (be it the driver, hardware scheduler, etc.) a 2D/3D copy boils down to a loop of repeated 128-bit load/store instructions over all addresses computed to be in range. Why would it matter that some intermediate addresses – which are never passed to such an instruction – have an entry in the page table or not?
For the “strange” case 2, if I modify the test so that the base pointer contains the offset that would normally be indicated by copy_src_y and set the srcY parameter to 0 instead, I convert that passing case to a failing one (“invalid argument”).
Just pure conjecture: The library function call attempts to determine if the range of space covered by the copy operation is allocated, and if not it returns the invalid argument error. The determination is not taking into account the y starting offset.
If that conjecture turned out to be the case, then I think it would be a bug to not take into account the y starting offset.
Otherwise, I have no comment on the underlying topic (which I think is probably the important one from OP’s perspective): Is it necessary to have a fully contiguous allocation spanning the entire range of the indicated copy, even if the hopscotch nature of the 2D copy would not actually touch the unallocated space?
I don’t know the answer to that. I agree with njuffa that you may wish to file a bug if you want to get beyond these rudimentary observations and guesses.
(As a further test case for copy_src_y being ignored in the validity check, if I take the passing test 2 case, and change copy_src_y to 2 instead of 1, I do not get an “invalid argument” error but instead get an unspecified launch failure from the cuMemcpy2D).
No update on the bug report yet, but in case anyone stumbles upon this, we’ve found somewhat of a workaround that enables this kind of copy: Simply alias all of the unallocated pages to a single physical memory allocation.
As it turns out these mappings can’t use CU_MEM_ACCESS_FLAGS_PROT_NONE as CUDA will still complain otherwise. This again seems strange to me and might be yet another indicator that the feasibility check is being too overzealous here.
If I may offer some advice. You might wish to be fairly specific about your desired outcome in the bug (You can always add more comments to the bug.) Here is what I mean. Based on your posting here and what I see in the bug (as well as some additional internal comments in the bug by our dev team that you can’t see) it appears that you have primarily asked for an explanation of case 2. It’s possible that the dev team might “fix” the validity checking and convert case 2 from a passing case to a failing case, and mark the bug as fixed.
If I had to guess (and I made this comment above already), what you’d really like is for cases 3 and 4 to start working. Case 2 is just an observation.
If that is the case, you might want to state that clearly, just as I did in a comment above, to wit:
Then follow that with:
“what I would really like is for cases 3 and 4 to work”
You can do as you wish of course, and I may have misinterpreted your intentions.
You are right, I do want cases 3 and 4 to work, and I was being vague about it (although I did say “I want to copy parts of said buffer […]” and “I hope [this is a bug and not a limitation]”).
I initially came to the forums mostly in hope to get clarification on the intended behavior of the API, as the documentation and the catch-all “invalid argument” error really leave you guessing here.
Thanks for the suggestion and the insight on the internal comments though! I will state my desired outcome in the bug report.