Hi there,
I came across an interesting behaviour recently that I wanted to ask about.
In our code we make use of jagged arrays quite extensively. We have some dedicated code that deals with the allocation, management and copies of such jagged arrays of objects.
The way we put such jagged arrays into memory, the “inner elements” of the arrays may or may not be forced to sit side-by-side in memory. (In some situations we implement jagged arrays simply as std::vector<std::vector<T> >
objects in our host code, in which case it’s generally not guaranteed how the memory would be laid out. But in some cases we use explicit memory handling for ensuring that we would have everything in a single blob. It’s a bit of a longer story…) Because of this, the code that we wrote for copying jagged arrays between host and device memory tries to minimise the number of copies that it would have to do. With “ideal” memory layout it would only have to do one cudaMemcpy(...)
call to perform the copy. But the code is implemented in a bit more of a general way. As it iterates over the “inner elements” of the array, it checks if they are continuous in memory, and performs the minimum number of copies that it can.
Now… Even when we use an std::vector<std::vector<T> >
object in host code to allocate the memory for our jagged array, by chance it may happen that two neighboring “inner elements” would end up side-by-side in memory. For very particular “inner element” sizes this has a high chance of happening. (Basically when the allocation size is divisible by the 0x400
alignment of CUDA allocations.) In this case our copy code would decide to issue a single cudaMemcpy(...)
command that would “cover” both allocations. (The endpoint of these copies is always inside of a big blob of allocation inside of device memory. Again, it’s a bit of a longer story…)
And it turns out that cudaMemcpy(...)
doesn’t like this. We get an “invalid argument” error from it for that copy operation.
Is this expected? Is the following expected to fail?
// CUDA include(s).
#include <cuda_runtime.h>
// System include(s).
#include <iostream>
// Helper macro for checking the CUDA call return values.
#define CHECK_CUDA(EXP) \
do { \
cudaError_t ec = EXP; \
if( ec != cudaSuccess ) { \
std::cerr << __FILE__ << ":" << __LINE__ \
<< " Failed to execute \"" << #EXP << "\" (" \
<< cudaGetErrorString( ec ) << ")" \
<< std::endl; \
return ec; \
} \
} while( false )
int main() {
int *host_a = nullptr, *host_b = nullptr, *device = nullptr;
CHECK_CUDA( cudaMallocHost( &host_a, 0x400 ) );
CHECK_CUDA( cudaMallocHost( &host_b, 0x400 ) );
if( ( reinterpret_cast< char* >( host_a ) + 0x400 ) !=
reinterpret_cast< char* >( host_b ) ) {
std::cout << "The two allocations are not neighbours in "
<< "memory. :-/" << std::endl;
return 0;
}
CHECK_CUDA( cudaMalloc( &device, 0x800 ) );
CHECK_CUDA( cudaMemcpy( device, host_a, 0x800,
cudaMemcpyHostToDevice ) );
return 0;
}
Because it seems to reliably fail for me.
[bash][thor]:build > make
[ 50%] Building CXX object CMakeFiles/cudaMultiAlloc.dir/cudaMultiAlloc.cxx.o
[100%] Linking CXX executable cudaMultiAlloc
[100%] Built target cudaMultiAlloc
[bash][thor]:build > ./cudaMultiAlloc
/home/krasznaa/Development/cuda/CUDAMultiAlloc/cudaMultiAlloc.cxx:33 Failed to execute "cudaMemcpy( device, host_a, 0x800, cudaMemcpyHostToDevice )" (invalid argument)
[bash][thor]:build >
In the end I can easily accept that this is just how the CUDA runtime behaves. I just wanted to double-check that I’m interpreting things correctly here…
Cheers,
Attila
P.S. If we allocate host code using malloc(...)
, such issues never happen. Though that may be just because malloc(...)
is less prone to placing two separate allocations side-by-side in memory. :-/