Single cudaMemcpy across multiple allocations

Hi there,

I came across an interesting behaviour recently that I wanted to ask about.

In our code we make use of jagged arrays quite extensively. We have some dedicated code that deals with the allocation, management and copies of such jagged arrays of objects.

The way we put such jagged arrays into memory, the “inner elements” of the arrays may or may not be forced to sit side-by-side in memory. (In some situations we implement jagged arrays simply as std::vector<std::vector<T> > objects in our host code, in which case it’s generally not guaranteed how the memory would be laid out. But in some cases we use explicit memory handling for ensuring that we would have everything in a single blob. It’s a bit of a longer story…) Because of this, the code that we wrote for copying jagged arrays between host and device memory tries to minimise the number of copies that it would have to do. With “ideal” memory layout it would only have to do one cudaMemcpy(...) call to perform the copy. But the code is implemented in a bit more of a general way. As it iterates over the “inner elements” of the array, it checks if they are continuous in memory, and performs the minimum number of copies that it can.

Now… Even when we use an std::vector<std::vector<T> > object in host code to allocate the memory for our jagged array, by chance it may happen that two neighboring “inner elements” would end up side-by-side in memory. For very particular “inner element” sizes this has a high chance of happening. (Basically when the allocation size is divisible by the 0x400 alignment of CUDA allocations.) In this case our copy code would decide to issue a single cudaMemcpy(...) command that would “cover” both allocations. (The endpoint of these copies is always inside of a big blob of allocation inside of device memory. Again, it’s a bit of a longer story…)

And it turns out that cudaMemcpy(...) doesn’t like this. We get an “invalid argument” error from it for that copy operation.

Is this expected? Is the following expected to fail?

// CUDA include(s).
#include <cuda_runtime.h>

// System include(s).
#include <iostream>

// Helper macro for checking the CUDA call return values.
#define CHECK_CUDA(EXP)                                              \
    do {                                                             \
        cudaError_t ec = EXP;                                        \
        if( ec != cudaSuccess ) {                                    \
            std::cerr << __FILE__ << ":" << __LINE__                 \
                      << " Failed to execute \"" << #EXP << "\" ("   \
                      << cudaGetErrorString( ec ) << ")"             \
                      << std::endl;                                  \
            return ec;                                               \
        }                                                            \
    } while( false )

int main() {

    int *host_a = nullptr, *host_b = nullptr, *device = nullptr;
    CHECK_CUDA( cudaMallocHost( &host_a, 0x400 ) );
    CHECK_CUDA( cudaMallocHost( &host_b, 0x400 ) );
    if( ( reinterpret_cast< char* >( host_a ) + 0x400 ) !=
          reinterpret_cast< char* >( host_b ) ) {
        std::cout << "The two allocations are not neighbours in "
                  << "memory. :-/" << std::endl;
        return 0;
    CHECK_CUDA( cudaMalloc( &device, 0x800 ) );
    CHECK_CUDA( cudaMemcpy( device, host_a, 0x800,
                            cudaMemcpyHostToDevice ) );

    return 0;

Because it seems to reliably fail for me.

[bash][thor]:build > make
[ 50%] Building CXX object CMakeFiles/cudaMultiAlloc.dir/cudaMultiAlloc.cxx.o
[100%] Linking CXX executable cudaMultiAlloc
[100%] Built target cudaMultiAlloc
[bash][thor]:build > ./cudaMultiAlloc 
/home/krasznaa/Development/cuda/CUDAMultiAlloc/cudaMultiAlloc.cxx:33 Failed to execute "cudaMemcpy( device, host_a, 0x800, cudaMemcpyHostToDevice )" (invalid argument)
[bash][thor]:build >

In the end I can easily accept that this is just how the CUDA runtime behaves. I just wanted to double-check that I’m interpreting things correctly here…


P.S. If we allocate host code using malloc(...), such issues never happen. Though that may be just because malloc(...) is less prone to placing two separate allocations side-by-side in memory. :-/

It’s not allowed. You should only pass arguments to cudaMemcpy which are valid. The size of the transfer must be consistent with the sizes of the allocated pointers that you passed. The fact that another allocated pointer happens to be nearby is not accounted for by the runtime API.

Yeah, that’s what I gathered.

But just to clarify: The “copy code”, as I described above, doesn’t know if the pointers that it sees are coming from a single cudaMalloc(...) call or from two different ones. (Since in our code we use single big allocations in some cases.)

So I guess we’ll need to propagate this info in some way to the “copy code” whether it is allowed to look for contiguous memory areas on a particular copy operation or not…

You can always file a bug if you wish (I would call it a request for enhancement, RFE, because I’m fairly certain the behavior you describe is, for that API call, “working as designed”).

I don’t believe the detailed behavior here is documented, but what I have observed is what I stated. cudaMemcpy certainly checks any pointers you pass that were already registered with the CUDA runtime, for validity (1. Is the pointer in the right space for the transfer kind 2. Does the pointer + transfer length define a region that is included in a single call to an allocation API such as cudaHostAlloc or cudaMalloc).

I’m fairly sure that sort of error checking is considered useful in a large variety of places and end-users, and I must say from my own personal experience, this process of looking for adjacency in independent allocations is not something I have come across. It strikes me as unusual.

So I doubt the CUDA designers would be ready to just drop the error checking that I believe is in place, and see what happens. However I’m sure there is a way to inspect all this, and follow some decision logic to see if every byte of a requested transfer can be found in some allocation request somewhere. You can imagine pathological situations, I think.

Another case is where the host side allocation is created with an API call (such as new, malloc, std::vector, etc.) that the runtime has no “ordinary” visibility into,. I don’t know what it does in this case, and in any event, as I stated already, I believe all of these details are undocumented.

In the current setting, the safe/expected thing is for each requested transfer region in its entirety to belong to a single allocation request that can be associated with that pointer.