Error in cudaMalloc before kernel launch with multiple threads in CPU

I have a function to launch kernell calls in a multithreaded application (with ONE single GPU). This is the function:

void Launch_Test(unsigned char *img_u,int width, int height, float threshold){
unsigned char *devimg;
unsigned char *devimgCopy;


size_t size=(1600 * 1200)*sizeof(unsigned char);

 checkCudaErrors(cudaMalloc((void **)&devimg, size));
 checkCudaErrors(cudaMalloc((void **)&devimgCopy, size));

 checkCudaErrors(cudaMemcpy(devimg, img_u, size, cudaMemcpyHostToDevice));
 checkCudaErrors(cudaMemcpy(devimgCopy, devimg, size, cudaMemcpyDeviceToDevice));

Test_kernel<<< dim3(100,75,1), dim3(16,16,1) >>>(1600, 1200, devimg, devimgCopy, threshold);
Test_Dilate<<< dim3(100,75,1), dim3(16,16,1) >>>(1600, 1200, devimg, devimgCopy); 

checkCudaErrors(cudaMemcpy(img_u, devimg,size, cudaMemcpyDeviceToHost));

It is used to process one 1600*1200 image and return the result after the two kernels operations.

When this function is called using one single CPU thread, it works fine.
When the function is called from two or more threads, the behaviour is unpredictable. It can fail when 2 or 6 or 7,… threads are used. The error is code=4(cudaErrorLaunchFailure) “cudaMalloc((void**)&devimg,size))”. It is possible to see in VS2012 that, when the program crashes, devimg is a null pointer. Making explicit inicialization of devimg inside the function (=new unsigned char[1600*1200;]) the result is the same.

Is this because while one thread is trying to access &devimg, other thread has cudaFree the pointer???

If this is the case, ¿How I should manage calls from multiple threads to this function?. Altought it seems that calling cudaSetDevice(0), the multithread calls should be managed, I have made test also using streams and cudaMemcpyAsync, and using context(API), but with no results. In some cases obtaining error in cudaMemcpy functions instead… I have also test with “–default-stream per-thread” command line option with the same result.

Visual profiler gives the error code=11(cudaErrorInvalidValue) "cudaMemcpy(devimg, img_u, size, cudaMemcpyHostToDevice). This error disappears when I define and use a hostimage pointer inside the function instead of img_u parameter

using GTX 970 and CUDA 7.0. I dont know what else to do…

The error code 4 is a kernel launch failure. It is a “sticky” cuda-context-corrupting error, which means that all subsequent cuda API calls in the same process will return that error. The reason devimg is NULL is not because the cudaMalloc operation failed but because the context was already corrupted when you called cudaMalloc. As a result, no allocation was made, and the pointer was NULL (and the cudaMalloc operation returned the “sticky” error, which actually originated somewhere else.)

The fact that the pointer was NULL is a red herring (not related to the cause of the issue).

You have a kernel launch that is failing. You’ll need to figure out why that is in the multi-threaded case. For example, you indicate you are running on windows on a GTX970 GPU, which is certainly in WDDM mode. Have you disabled the WDDM TDR mechanism?

No, this is not happening because one thread is trying to access devimg while another thread is trying to free the pointer. The pointers in question are local to the function, which is local to a thread. Therefore freeing devimg in one thread has nothing to do with the devimg pointer in another thread.

Anyway, your problem cannot be accurately diagnosed until you identify the reason for the kernel launch failure, and there is not enough information in your question to do that.

Initializing devimg this way:

=new unsigned char[1600*1200];

certainly cannot help. That is creating a host allocation. The usage of devimg in your kernels demands a device allocation.

Now Im testing the application in a GTX660 and CUDA 7.0. I have disabled now WDDM TDR and the first error I get is: code=77(cudaErrorIllegalAddress) “cudaMalloc((void**)&devimg, size)”. I dont know if it could tell you something. I can see again a null pointer in devimg so, I think that is the same problem, as you have indicated.
Could it be because there is an error inside the kernel itself (for one thread allways runs fine)??, If I comment the two lines with the kernels calls everything(cudaMallocs and cudaMemcpys) runs ok with 8 threads.

…or is it due to the thread acces…Do you know if is it necessary to make any kind of explicit bloking for the threads in order to call this function sequentially and not in parallel??.

Thanks a lot for your response.

Yes, there is an error inside the kernel itself. the code 77 cudaErrorIllegalAddress means exactly that.

You can use a method similar to what is discussed in the answer here:

to localize the illegal address error to a single line of your kernel that is failing.

Thank you very much for such an accurate diagnosis and for the information. Indeed it was a “hidden” out of bounds acces in img_u pointer inside the kernel.

Now the application runs ok with 8 threads accesing the kernel.

If I run cuda_memcheck I still get code=11(cudaErrorInvalidValue) “cudaMemcpy(devimg, img_u, size, cudaMemcpyHostToDevice)” ========= Program hit cudaErrorInvalidValue (error 11) due to “invalid argument”
on CUDA API call to cudaMemcpy, both using img_u as parameter or a hostimage pointer defined inside the function. But the application seems to run ok…

Thank you again.

Your kernel invocations as posted do not pass img_u to the kernel, so an “out of bounds acces in img_u pointer inside the kernel” is a very puzzling statement.

My guess would be that your application is still broken, then.

sorry, devimg pointer inside the kernel, not img_u.

The error code 11 using cuda-memcheck is obtained also even the two kernel calls are commented

There is probably something wrong with your img_u pointer.

Perhaps it is not allocated to the correct size. Or perhaps you are passing a host pointer instead of a device pointer. Or perhaps it has no allocation at all.

Who can say? You haven’t provided a code that would answer these questions or that anyone else could test.

This is the calling function, that receives an OpenCV Mat and pases its data to the functions that cals the kernels:

int With_CUDA(cv::Mat img){

int Tthreshold=22;
uchar *img_u=new uchar[1600*1200];;


return 0;


There are a couple problems that I can see. Without a complete test case, it’s not really clear how signficant those problems are.

Incidentally your cross-posted question on SO has been down-voted and close-voted I suspect for the same reason: you haven’t provided a complete case that someone else could work with.

Nobody wants to see your whole code, but if you can’t or won’t provide a stripped-down reduced test case that just shows the issue, then it’s often not possible to provide a crisp answer.

This is creating an allocation:

uchar *img_u=new uchar[1600*1200];

That means a pointer value has been assigned to img_u which points to the new allocation created of 1600x1200 uchar.

Now this:;

REPLACES the pointer value created by new, with a different one that is derived from the cv:Mat img. (Perhaps you think that line copies data from one place to another. It does not. You’ll need to review basic C programming concepts.) This has a couple implications:

  1. If the cv:Mat img isn’t a 1600x1200 uchar allocation, then the pointer you are now passing to your Launch_Test function is not pointing to a data region of appropriate size for your cudaMemcpy operation. This could trigger the invalid argument error mentioned previously.

  2. You mentioned this is a multithreaded application, so presumably you are calling this function repeatedly. Your allocation created by new in this function is now “lost”, and you would have no way to access it or free it (with delete). Therefore your program has a memory leak. How serious this leak is, is difficult to say, since you haven’t provided a complete case that someone else could analyze.

I’m not going to continue this process, where you dribble out a few pieces of information at a time. Perhaps someone else can help you if you want to operate this way. A more efficient use of the time for those who are willing to help you, is if you provide a complete test case. That isn’t a request for your complete application. It is a request for a simplified version that just shows the problem in question.

txbob, I really appreciate your help. Your responses have been very accurate and helpful (even with those “few pieces of information”)…