X64,VC2013,WIN8.1 , cudaMallocPitch block forever.

when i call cudaMallocPitch, that API block the program forever

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <opencv2/opencv.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <string.h>
#include <stdlib.h>

using namespace cv;

int main(int argc, char *argv)

cv::VideoCapture capture("video.avi");

cv::Mat image_h1(720, 480, CV_32FC3);
cv::Mat image_h2(720, 480, CV_32FC1);

gpu::GpuMat device_image;
gpu::GpuMat device_edge(image_h2);//hang by call cudamallocpitch, that why, can anyone help me

for (;;) {
    printf("that ok\r\n");
    capture >> image_h1;

    if (image_h1.empty())

    cvtColor(image_h1, image_h2, CV_RGB2GRAY);
    gpu::Canny(device_image, device_edge, 50., 100.);

    cv::imshow("canny", image_h2);

// image_d.release();
// image_h1.release();
// capture.release();


Maybe your system just isn’t working properly. Can you run other CUDA codes? (such as cuda vectorAdd sample code)?

I am just having the same problem in my application (Visual Studio 2008, Cuda Toolkit 5.0, 64-bit).

On my Maxwell card (GTX 960) the ‘cudaMallocPitch’ never returns (hangs forever), on my Kepler card (GTX 770) it works perfectly.

I had a similar problem also with a cudaMemcpy2D function. There, it worked perfectly for both cards with Visual Studio 2013 and Cuda Toolkit 7.0

My conclusion: Older toolkits on newer cards might be a problem, at least with Geforce cards.

In our application, all GPU functions are assigned to a certain ‘GPUWorker’ object (which handles on GPU) and which executes them then in a exclusive way. I think i can remember if i called the API functions directly, it worked (but we have to do it that way in order to stay CPU-thread-safe).

While it is best to use a CUDA version no older than the GPU architecture used, as long as the driver stack is up to date and includes support for the GPU in question, applications generated with older CUDA versions should work, provided a PTX versions of each CUDA kernel is embedded in the app that can be JIT compiled by the driver to generate machine code for the new-architecture GPU.

If you have self-contained repro code that demonstrates these allocation call hangs, it would probably be a good idea to file a bug with NVIDIA, using the reporting form linked from the registered developer website. Make sure that cuda-memcheck shows no other errors when running the code.

I’ve seen this before when I was using the CUDPP library for device function level prefix sums

The problem was the CUDPP library was compiled for SM 2.0 & SM 3.0, while my EXE was only compiled for SM 5.0.

I filled a bug report and got an excellent explanation:

Is your OpenCV compiled for the same platforms as your EXE?

This is an excellent working hypothesis, especially if the hanging cudaMalloc* call is the first in the application, which triggers CUDA context initialization and thus potentially JIT compilation. I assumed that the people stating that the API calls hang “forever” had actually tried a reasonable approximation of “forever”, such as letting the app sit overnight; but maybe not.

OK, interesting. Actually, we have also a link-dependency to the CUDPP library.

hanging ‘forever’ in my case was meaning it was one minute or more, can’t remember exactly. so forever is exaggerated.

For Cuda Toolkit 5.0, we compile only for 2.0 and 3.0 (both PTX and Byte-code).

Is the JIT compilation doing only the CUDA-accelerated library, or also all dependent librariex (like CUDPP) which contain CUDA kernels ?

Next question, can is detect somehow during a ‘hang’ that a JIT compilation is done currently ? It should be visible in the task manager, the JIT compiler should show up there or ?

Can I ‘force’ the JIT compilation to occur at a certain point (e.g. when starting an application) via a CUDA runtime (driver) API function call ?

Any GPU kernels that are not compiled for the runtime detected architecture will need to be JIT-compiled. So for a library like CUDPP, if that library was compiled for cc2.0 and cc3.0, and you run it on a cc5.0 device, JIT-compiling will occur.

The JIT-compile should occur, all at once, at the point of runtime initialization. Thus you should be able to “force” this with a call to cudaFree(0); at the beginning of your application.

However, runtime initialization is a kind of “lazy” initialization or “fire and forget”. So it’s not necessarily all complete by the time control is returned to the CPU thread for processing of the next line of your code after the call to cudaFree(0); following the above example. Certainly by the time of your first kernel call (and probably before) the runtime initialization should be complete. At that point, all JIT-compiling would be complete; the JIT-compile process does not smear itself out over your entire application.

OK i see. Because I for sure do a few CUDA API commands (cudaSetDevice, cudaGetDeviceProperties) before the CUDA API command (cudaMemcopy - host<-> device, i think also cudaMallocPitch) which then hangs on Maxwell.

Maybe there are CUDA API functions which do not have to wait for JIT compilation to finish (like cudaSetDevice), and other CUDA API functions which have to wait for JIT compilation to finish.

Regarding this, I would really appreciate more insight in the JIT process, e.g. by some environment variable for ‘verbose mode’, which when set, lets the JIT compiler print out some logging information to a certain file.

I see I was sloppy above. I should have referred specifically to CUDA runtime context creation. The CUDA Programming Guide states (emphasis mine): “There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).[…] As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory.” This is why a call to cudaSetDevice() does not trigger JIT compilation.

You can have finer control over JIT compilation by using the CUDA driver interface instead of the CUDA runtime, as the loading and JIT compilation of code requires specific API calls at that level, i.e. it is not something that happens under the hood. This approach is typically used by apps that dynamically compile code at application run time.

In general, I would advise to avoid JIT compilation and instead building all code as a fat binary with SASS (machine code) for all currently shipping/supported architectures (sm_20 through sm_52) with PTX only for the latest architecture to be JIT compiled on future GPU architectures.

OK, so that means

for Cuda Toolkit 5.0
-gencode arch=compute_11,code=sm_11
-gencode arch=compute_13,code=sm_13
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_30,code=compute_30

for Cuda Toolkit 7.0
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_50,code=compute_50

(excluding tegra stuff with compute capability 3.2)

is that right ? Or did I miss something ?

Unfortunately, i suppose when I compile OpenCV GPU module and CUDPP that way that will make the DLLs even bigger … They are already hundreds of megabyte …

Note there is a similar thread (CUDPP related) at

thanks to every people, especial to uncle joe.