Error 11 in cudaMemcpyAsync (DtH) if *dst is allocated with cudaMallocHost

If I try to use memory allocaded with cudaMallocHost() program crashes at cudaMemcpyAsync() with error 11: invalid argument:

void* fOutBuffers[2];
cudaMallocHost((void**)&(fOutBuffers[0]), fNetwork->outputBufferSize(0) * sizeof(float));
cudaMallocHost((void**)&(fOutBuffers[1]), fNetwork->outputBufferSize(1) * sizeof(float));
// . . .
// error 11 here:
cudaMemcpyAsync((void*)(fOutBuffers[0]), bindings[1], fNetwork->outputBufferSize(0) * sizeof(float), cudaMemcpyDeviceToHost, fStream);
cudaMemcpyAsync((void*)(fOutBuffers[1]), bindings[2], fNetwork->outputBufferSize(1) * sizeof(float), cudaMemcpyDeviceToHost, fStream);

But!
If I allocate fOutBuffers without cudaMallocHost() everything works:

float* fOutBuffers[2];
fOutBuffers[0] = new float[fNetwork->outputBufferSize(0)];
fOutBuffers[1] = new float[fNetwork->outputBufferSize(1)];
// . . .
// works:
cudaMemcpyAsync((void*)(fOutBuffers[0]), bindings[1], fNetwork->outputBufferSize(0) * sizeof(float), cudaMemcpyDeviceToHost, fStream);
cudaMemcpyAsync((void*)(fOutBuffers[1]), bindings[2], fNetwork->outputBufferSize(1) * sizeof(float), cudaMemcpyDeviceToHost, fStream);

So what am I doing wrong?

Suggest checking the return value from cudaMallocHost as it may provide a reason for the failure. For debugging, I would initialize fOutBuffers[2] = { NULL } and then see if they are being changed.

I use cudaMallocHost to create pinned memory between the host/device. Once created, both sides can access directly without helper functions like cudaMemcpyAsync. Imagine cudaMemcpyAsync would still work and thus suspect the issue is during allocation. If you resolve the allocation issue, suspect you can simplify your subsequent usage. Good luck.

1 Like

Return value is cudaSuccess for both cudaMallocHost calls.

I tried that and app crashes with exception in nvcuda.dll before returning any cudaError_t value from cudaMemcpyAsync.

To be honest I didn’t understand that part. Af far as I found I have to use page-locked memory to increase performance of cudaMemcpyAsync. Are you saying I shouldn’t use it and keep second variant of code w/o page-locking as long as it works?

Ahh-- your bindings[0/1] must be previously allocated via cudaMalloc()? This is different from my case where I use cudaMallocHost as a single shared buffer between host/device. My code is something like:

// example results structure
typedef struct {
  uint32_t volatile cnt;
  uint32_t gid[255];
} result_t;

// device side
__constant__ result_t d_results;

// host side
result_t *results;
cudaMallocHost(&results, sizeof(results));
cudaMemcpyToSymbol(d_results, &results, sizeof(result));

// prior to kernel launch, clear result count on host
results->cnt = 0;

// on device, add a result (proper version uses atomicInc)
d_results->gid[0] = 12345;
d_results->cnt = 1;

// after kernel completion, look for results
if (results->cnt > 0)
  results->gid[0] = something useful

cudaMalloc/cudaMemcpyToSymbol happen once and then I just repeat cnt = 0 (host), populating results (device) and checking for cnt > 0 (host) after each kernel completes.

Have some other code that uses cudaMemcpy to return device memory. Will allocate the return buffer with cudaMallocHost and see what happens. Seems closer to your use case.

1 Like

Correct. I forgot to mention that.

The weird thing №1: in sampleOnnxMNIST.cpp and it’s common\buffers.h (which I used as an example for my code) memory on host is allocated only with common malloc. Even tho documentatios says it is necessary to allocate page-locked memory for cudaMemcpyAsync.
The weird thing №2: I get cudaErrorInvalidValue only on second cudaMemcpyAsync call, first one returns cudaSuccess.

Updated some code that was fairly close to your second case (except using cudaMemcpy). Changed the destination buffer from local stack to cudaMallocHost allocated memory. Worked fine and gave the proper results.

My code is not structured for async operation at that point, but changed to cudaMemcpyAsync(stream=0) to see whether it would duplicate your crash. No crash for me, though incorrect results because my code attempted to use the data before the async completed.

Have you validated that fStream is correct? Maybe try passing zero to see if the default case works. That is all I can think of at this point.

1 Like

fStream is used for TensorRT execution with sync cudaMemcpy calls earlier, it is 100% correct.

And I think now I found my mistake.
First I messed up index in second cudaMallocHost, allocatins size of fNetwork->outputBufferSize(0) for both (which is different, for [1] it is 4 times bigger). I found it and fixed in code and in this topic, but only after I’ve…
Messed up second time when forgot to change back values in second cudaMemcpyAsync call when earlier I tried different approaches to understand mistake.
Now in both places there are proper indexes and sizes and it finally works.

Big thanks for your time and sorry you had to spend it on my stupidity.

As one final test, I tried cudaMemcpyAsync with a stack allocated destination. While the documentation says it should return an error (pageable destination), it worked for me. Looks like it reverts to synchronous cudaMemcpy (at least with the drivers I am running).

Random guess-- are bindings[1/2] supposed to be 0/1? If a typo, could see it making the second one fail.

They are correct. bindings[0] is input buffer for neural network, [1] and [2] are two output buffers.
Also read my previous message.

Excellent-- glad you got it resolved. Good luck with the rest of your project.

Thanks again. It is almost done, I had to fix one big issue and change sync CUDA/TensorRT calls to async.
First thing I’ve done this wednesday and now I finished second one.
Now I only have to test project on Jetson instead of my desktop and if all’s good — I’m finally done.