Error 11 in cudaMemcpyAsync (DtH) if *dst is allocated with cudaMallocHost

redoutracer · July 1, 2020, 2:37pm

If I try to use memory allocaded with cudaMallocHost() program crashes at cudaMemcpyAsync() with error 11: invalid argument:

void* fOutBuffers[2];
cudaMallocHost((void**)&(fOutBuffers[0]), fNetwork->outputBufferSize(0) * sizeof(float));
cudaMallocHost((void**)&(fOutBuffers[1]), fNetwork->outputBufferSize(1) * sizeof(float));
// . . .
// error 11 here:
cudaMemcpyAsync((void*)(fOutBuffers[0]), bindings[1], fNetwork->outputBufferSize(0) * sizeof(float), cudaMemcpyDeviceToHost, fStream);
cudaMemcpyAsync((void*)(fOutBuffers[1]), bindings[2], fNetwork->outputBufferSize(1) * sizeof(float), cudaMemcpyDeviceToHost, fStream);

But!
If I allocate fOutBuffers without cudaMallocHost() everything works:

float* fOutBuffers[2];
fOutBuffers[0] = new float[fNetwork->outputBufferSize(0)];
fOutBuffers[1] = new float[fNetwork->outputBufferSize(1)];
// . . .
// works:
cudaMemcpyAsync((void*)(fOutBuffers[0]), bindings[1], fNetwork->outputBufferSize(0) * sizeof(float), cudaMemcpyDeviceToHost, fStream);
cudaMemcpyAsync((void*)(fOutBuffers[1]), bindings[2], fNetwork->outputBufferSize(1) * sizeof(float), cudaMemcpyDeviceToHost, fStream);

So what am I doing wrong?

teamvraz · July 1, 2020, 8:19pm

Suggest checking the return value from cudaMallocHost as it may provide a reason for the failure. For debugging, I would initialize fOutBuffers[2] = { NULL } and then see if they are being changed.

I use cudaMallocHost to create pinned memory between the host/device. Once created, both sides can access directly without helper functions like cudaMemcpyAsync. Imagine cudaMemcpyAsync would still work and thus suspect the issue is during allocation. If you resolve the allocation issue, suspect you can simplify your subsequent usage. Good luck.

redoutracer · July 1, 2020, 8:33pm

Return value is cudaSuccess for both cudaMallocHost calls.

I tried that and app crashes with exception in nvcuda.dll before returning any cudaError_t value from cudaMemcpyAsync.

To be honest I didn’t understand that part. Af far as I found I have to use page-locked memory to increase performance of cudaMemcpyAsync. Are you saying I shouldn’t use it and keep second variant of code w/o page-locking as long as it works?

teamvraz · July 1, 2020, 9:18pm

Ahh-- your bindings[0/1] must be previously allocated via cudaMalloc()? This is different from my case where I use cudaMallocHost as a single shared buffer between host/device. My code is something like:

// example results structure
typedef struct {
  uint32_t volatile cnt;
  uint32_t gid[255];
} result_t;

// device side
__constant__ result_t d_results;

// host side
result_t *results;
cudaMallocHost(&results, sizeof(results));
cudaMemcpyToSymbol(d_results, &results, sizeof(result));

// prior to kernel launch, clear result count on host
results->cnt = 0;

// on device, add a result (proper version uses atomicInc)
d_results->gid[0] = 12345;
d_results->cnt = 1;

// after kernel completion, look for results
if (results->cnt > 0)
  results->gid[0] = something useful

cudaMalloc/cudaMemcpyToSymbol happen once and then I just repeat cnt = 0 (host), populating results (device) and checking for cnt > 0 (host) after each kernel completes.

Have some other code that uses cudaMemcpy to return device memory. Will allocate the return buffer with cudaMallocHost and see what happens. Seems closer to your use case.

redoutracer · July 1, 2020, 9:39pm

Correct. I forgot to mention that.

The weird thing №1: in sampleOnnxMNIST.cpp and it’s common\buffers.h (which I used as an example for my code) memory on host is allocated only with common malloc. Even tho documentatios says it is necessary to allocate page-locked memory for cudaMemcpyAsync.
The weird thing №2: I get cudaErrorInvalidValue only on second cudaMemcpyAsync call, first one returns cudaSuccess.

teamvraz · July 1, 2020, 9:42pm

Updated some code that was fairly close to your second case (except using cudaMemcpy). Changed the destination buffer from local stack to cudaMallocHost allocated memory. Worked fine and gave the proper results.

My code is not structured for async operation at that point, but changed to cudaMemcpyAsync(stream=0) to see whether it would duplicate your crash. No crash for me, though incorrect results because my code attempted to use the data before the async completed.

Have you validated that fStream is correct? Maybe try passing zero to see if the default case works. That is all I can think of at this point.

redoutracer · July 1, 2020, 9:54pm

fStream is used for TensorRT execution with sync cudaMemcpy calls earlier, it is 100% correct.

And I think now I found my mistake.
First I messed up index in second cudaMallocHost, allocatins size of fNetwork->outputBufferSize(0) for both (which is different, for [1] it is 4 times bigger). I found it and fixed in code and in this topic, but only after I’ve…
Messed up second time when forgot to change back values in second cudaMemcpyAsync call when earlier I tried different approaches to understand mistake.
Now in both places there are proper indexes and sizes and it finally works.

Big thanks for your time and sorry you had to spend it on my stupidity.

teamvraz · July 1, 2020, 9:58pm

As one final test, I tried cudaMemcpyAsync with a stack allocated destination. While the documentation says it should return an error (pageable destination), it worked for me. Looks like it reverts to synchronous cudaMemcpy (at least with the drivers I am running).

Random guess-- are bindings[1/2] supposed to be 0/1? If a typo, could see it making the second one fail.

redoutracer · July 1, 2020, 10:00pm

They are correct. bindings[0] is input buffer for neural network, [1] and [2] are two output buffers.
Also read my previous message.

teamvraz · July 1, 2020, 10:05pm

Excellent-- glad you got it resolved. Good luck with the rest of your project.

redoutracer · July 1, 2020, 10:08pm

Thanks again. It is almost done, I had to fix one big issue and change sync CUDA/TensorRT calls to async.
First thing I’ve done this wednesday and now I finished second one.
Now I only have to test project on Jetson instead of my desktop and if all’s good — I’m finally done.

Topic		Replies	Views
cudaSafeCall() Runtime API error 11 CUDA Programming and Performance	12	7249	March 9, 2012
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4506	February 8, 2010
cudaFree is returning an unrecognised error code CUDA Programming and Performance	10	7946	March 13, 2009
cudaMalloc error in big loop CUDA Programming and Performance	12	15608	May 21, 2008
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16287	September 29, 2017
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2078	May 29, 2018
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	1935	October 17, 2015
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3999	January 28, 2022
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7505	January 11, 2020
cudaMemset in 11.4: what causes it to give cudaErrorInvalidValue? CUDA Programming and Performance	13	6810	September 21, 2021

Error 11 in cudaMemcpyAsync (DtH) if *dst is allocated with cudaMallocHost

Related topics