cudaSafeCall() Runtime API error 11

faabiioo · March 8, 2012, 10:43am

Hi,

almost 24h stucked with this error, which I can’t solve out.

here is the situation:

page-locked host memory allocation

bytesS = numOfStreams * size;

 o_bytesS = numOfStreams * otherSize;

T* h_idata = NULL; // h_input	

 T* h_odata = NULL; // h_output

        cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS)   );

	cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );

device memory allocation

T* d_idata = NULL; // d_input

 T* d_odata = NULL; // d_output

        cutilSafeCallNoSync( cudaMalloc((void**) &d_idata, bytesS)   );

        cutilSafeCallNoSync( cudaMalloc((void**) &d_odata, o_bytesS) );

streams creation

cudaStream_t *stream = (cudaStream_t*) malloc( numOfStreams*sizeof(cudaStream_t) );

        for(int i=0; i<numOfStreams; i++){

	   cutilSafeCall( cudaStreamCreate(&(stream[i])) );

	}

running…

for(int i=0; i<numOfStreams; i++){

        // here I get the error

        cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytesS, h_idata + i*bytesS, bytesS, cudaMemcpyHostToDevice, stream[i]) );

.... // launch kernel, etc.

 }

Since the structure seems to me quite correct, and as well it follows the instructions given in the programming guide concerning stream handling and async copies, I don’t really understand what is wrong here.

Device: GeForce GTX 285

could somebody explain what the problem is

Gilles_C · March 8, 2012, 10:54am

Hi,
Just out of curiosity, did you check which value i holds when you encounter the error?
Could it be due to a previously launched kernel inside your loop?

faabiioo · March 8, 2012, 11:33am

uhm, what do you mean with “which value it holds”?

this is how I “initialize” the input data:

for(int i=0; i<size; i++) {

     // Keep the numbers small

     h_idata[i] = (rand() & 0xFF) / (T)RAND_MAX;

 }

‘size’ is the number of values upon which I have to operate

no other kernel is launched inside the loop. anyway, to avoid problems I already put a cudaDeviceSynchronize() right before the loop.

Gilles_C · March 8, 2012, 12:57pm

I was referring to the loop counter “i”. What is its value when you get the error?
Basically, what I want to know is if it’s at the first attempt to call cudaMemcpyAsync() that the error is detected, or if it’s during a subsequent attempt.
In which case, the error detection might not be due to the cudaMemcpyAsync() itself, but rather to a previous and not yet detected error. For example, it could be the case that one of your kernels (as launched inside the loop right after the call to cudaMemcpyAsync()) could have trigger the error.
Does that make sense?

faabiioo · March 8, 2012, 3:04pm

ok, I totally misunderstood your first reply!

anyway, this is a slightly modified version, which aims to allow each stream to process its portion of input data ( recall [font=“Courier New”]bytes = size * sizeof(T);[/font] )

for(int i=0; i<numOfStreams; i++){

        cudaStreamSynchronize(stream[i]);

        printf("i_i: %d\n;", i);

        // here I get the error

        cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytes, h_idata + i*bytes, bytes, cudaMemcpyHostToDevice, stream[i]) );

// ARE dimensions right? recall I allocated 'numOfStreams*bytes' space. so copying 'bytes' amount of data for

        // each stream should do the work. isn't it?

.... // launch kernel, etc.

 }

adding some prints, it turns out that the first attempt works fine (prints “i_i: 0”); the error is detected at the second attempt, because it prints “i_i: 1;” and then reports the error.

the sync on the stream doesn’t change much!

Gilles_C · March 8, 2012, 3:18pm

So now that you know that i==1 when the error is detected, you might check if the error is already set prior to call cudaMemcpyAsync().
Just call cudaGetLastError() straight and check its result. This way you’ll know if the error actually comes from your call to cudaMemcpyAsync().
Does that make sense to you?

faabiioo · March 8, 2012, 4:39pm

uhm, that’s the reason why I use the ‘cudaSafeCall’ wrapper: it tells me exactly that at row X (the row above) there’s an error.

Precisely, it says:

“file.cpp(321) : cudaSafeCall() Runtime API error 11: invalid argument.”

so I know that the error comes from there.

and actually, error 11 is a “cudaErrorInvalidValue”, which means that “…one or more of the parameters passed to the API call is not within an acceptable range of values.”

I do understand what that error means, but still cannot figure out where is the mistake.

btw, thanks for your concern.

tera · March 8, 2012, 4:51pm

Not necessarily, as Gilles_C has been trying to point out in the past few posts. If you look at the documentation for cudaMemcpyAsync(), it says “Note that this function may also return error codes from previous, asynchronous launches.”

faabiioo · March 8, 2012, 5:09pm

Ok, you’re right.

I tried with cudaGetLastError, indeed, but it did not report anything.

faabiioo · March 9, 2012, 4:25pm

ok, now that’s even more annoying because I solved it by trials and errors but I’m not really aware of what I’ve done. Or better, I know what I’ve done but I’m not sure I understand what the problem was.

so, to recall something:

// host side memory size

unsigned int bytes = size * sizeof(T);

unsigned int bytesS = numOfStreams * bytes;

// device memory size

unsigned int o_bytes = numBlocks*sizeof(T);

unsigned int o_bytesS = numOfStreams*(o_bytes);

// allocate page-locked host memory

T* h_idata = NULL;

T* h_odata = NULL;

cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS)   );

cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );

// allocate device memory and data

T* d_idata = NULL;

T* d_odata = NULL;

cutilSafeCall( cudaMalloc((void**) &d_idata, bytesS)   );

cutilSafeCall( cudaMalloc((void**) &d_odata, o_bytesS) );

// array of streams handles

cudaStream_t *stream = (cudaStream_t*) malloc(numOfStreams*sizeof(cudaStream_t));

for(int i=0; i<numOfStreams; i++){

    cutilSafeCall( cudaStreamCreate(&(stream[i])) );

}

// run kernels on streams - NOW WORKING!

for(int i=0; i<numOfStreams; i++){

    cudaStreamSynchronize(stream[i]);

cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, size, cudaMemcpyHostToDevice, stream[i]) );

    cutilSafeCall( cudaMemcpyAsync(d_odata + i * numBlocks, h_odata + i * numBlocks, numBlocks, cudaMemcpyHostToDevice, stream[i]) );

reduceS<T>(size, numThreads, numBlocks, d_idata, d_odata, stream[i]);

.... // rest of the code

}

basically what I changed is the quantity of bytes copied and the offsets, which are now ruled by ‘size’, that is the number of elements upon which I work. I don’t understand why the previous quantities were wrong and caused the problem!

maybe it’s a dumb question but…

Gilles_C · March 9, 2012, 7:44pm

The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.

The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.

The line should become (if I read correctly the code, which I’m not so sure since the various names look misleading to me):

cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );

faabiioo · March 9, 2012, 9:28pm

The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.

The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.

The line should become (if I read correctly the code, which I’m not so sure since the various names look misleading to me):
cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );

actually it works either ways, but I do understand what you say. and that could also be the solution for another problem I encountered.

I mean it works fine either with size as ‘count’ parameter or with bytes

thanks a lot

tera · March 9, 2012, 9:39pm

It certainly doesn’t, unless you happen to test with a type whose size happens to be one byte (‘char’ comes to mind).

Topic		Replies	Views
Error 11 in cudaMemcpyAsync (DtH) if *dst is allocated with cudaMallocHost CUDA Programming and Performance	11	898	October 12, 2021
cudaStream alloc after free result in oom CUDA Programming and Performance	8	45	January 1, 2025
cudaMalloc error in big loop CUDA Programming and Performance	12	15608	May 21, 2008
Time for Splitting up Memory Transfers? CUDA Programming and Performance	8	3365	August 14, 2010
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2078	May 29, 2018
Problem using cuda streams CUDA Programming and Performance	13	2005	March 4, 2015
performance problem CUDA Programming and Performance	2	611	July 16, 2018
11.2 > cudaMemPool_t and Peer2Peer CUDA Programming and Performance	4	1059	January 14, 2021
Multi stream multi GPU CUDA Programming and Performance cuda	9	1129	October 6, 2023
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	703	September 29, 2018

cudaSafeCall() Runtime API error 11

Related topics