X64,VC2013,WIN8.1 , cudaMallocPitch block forever.

Chenghaibo · June 24, 2015, 3:37am

when i call cudaMallocPitch, that API block the program forever

Chenghaibo · June 25, 2015, 3:34am

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <opencv2/opencv.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <string.h>
#include <stdlib.h>

using namespace cv;

int main(int argc, char *argv)
{
cudaSetDevice(0);
cvNamedWindow(“canny”);

cv::VideoCapture capture("video.avi");

cv::Mat image_h1(720, 480, CV_32FC3);
cv::Mat image_h2(720, 480, CV_32FC1);

gpu::GpuMat device_image;
gpu::GpuMat device_edge(image_h2);//hang by call cudamallocpitch, that why, can anyone help me

for (;;) {
    printf("that ok\r\n");
    capture >> image_h1;

    if (image_h1.empty())
        break;

    cvtColor(image_h1, image_h2, CV_RGB2GRAY);
    device_image.upload(image_h2);
    gpu::Canny(device_image, device_edge, 50., 100.);
    device_image.download(image_h2);

    cv::imshow("canny", image_h2);
    cvWaitKey(30);
}

// image_d.release();
//image_e.release();
// image_h1.release();
// capture.release();

}

Robert_Crovella · June 25, 2015, 4:06am

Maybe your system just isn’t working properly. Can you run other CUDA codes? (such as cuda vectorAdd sample code)?

HannesF99 · July 17, 2015, 5:29pm

I am just having the same problem in my application (Visual Studio 2008, Cuda Toolkit 5.0, 64-bit).

On my Maxwell card (GTX 960) the ‘cudaMallocPitch’ never returns (hangs forever), on my Kepler card (GTX 770) it works perfectly.

I had a similar problem also with a cudaMemcpy2D function. There, it worked perfectly for both cards with Visual Studio 2013 and Cuda Toolkit 7.0

My conclusion: Older toolkits on newer cards might be a problem, at least with Geforce cards.

In our application, all GPU functions are assigned to a certain ‘GPUWorker’ object (which handles on GPU) and which executes them then in a exclusive way. I think i can remember if i called the API functions directly, it worked (but we have to do it that way in order to stay CPU-thread-safe).

njuffa · July 17, 2015, 6:31pm

While it is best to use a CUDA version no older than the GPU architecture used, as long as the driver stack is up to date and includes support for the GPU in question, applications generated with older CUDA versions should work, provided a PTX versions of each CUDA kernel is embedded in the app that can be JIT compiled by the driver to generate machine code for the new-architecture GPU.

If you have self-contained repro code that demonstrates these allocation call hangs, it would probably be a good idea to file a bug with NVIDIA, using the reporting form linked from the registered developer website. Make sure that cuda-memcheck shows no other errors when running the code.

Uncle_Joe · July 17, 2015, 11:11pm

I’ve seen this before when I was using the CUDPP library for device function level prefix sums

The problem was the CUDPP library was compiled for SM 2.0 & SM 3.0, while my EXE was only compiled for SM 5.0.

I filled a bug report and got an excellent explanation:

From 4/18/2014

Here is a summary of this issue that updated from our developer team:

cudpp library, since is maintained externally does not update the configs to compile this library for new architecture whenever a new GPU/GPU arch is released.

If the cudpp library is not compiled to GPU arch version of the GPU on which the cudpp linked app is run on, then in such a case, cuda driver does not find the right arch version-ed elf to load on the system.

In such cases, failure to find the right elf forces JIT compilation of the cudpp library.

This JIT compilation [which happens inside cudaHostAlloc()] of the entire library which is huge appears to be hanging inside of the JIT PTX compiler but is taking a long time to compile and IS NOT A hang.

This is not a bug in cuda runtime nor cuda compiler but is what is expected to happen.

Is your OpenCV compiled for the same platforms as your EXE?

njuffa · July 17, 2015, 11:29pm

This is an excellent working hypothesis, especially if the hanging cudaMalloc* call is the first in the application, which triggers CUDA context initialization and thus potentially JIT compilation. I assumed that the people stating that the API calls hang “forever” had actually tried a reasonable approximation of “forever”, such as letting the app sit overnight; but maybe not.

HannesF99 · July 30, 2015, 2:08pm

OK, interesting. Actually, we have also a link-dependency to the CUDPP library.

hanging ‘forever’ in my case was meaning it was one minute or more, can’t remember exactly. so forever is exaggerated.

For Cuda Toolkit 5.0, we compile only for 2.0 and 3.0 (both PTX and Byte-code).

Is the JIT compilation doing only the CUDA-accelerated library, or also all dependent librariex (like CUDPP) which contain CUDA kernels ?

Next question, can is detect somehow during a ‘hang’ that a JIT compilation is done currently ? It should be visible in the task manager, the JIT compiler should show up there or ?

Can I ‘force’ the JIT compilation to occur at a certain point (e.g. when starting an application) via a CUDA runtime (driver) API function call ?

Robert_Crovella · July 30, 2015, 3:46pm

Any GPU kernels that are not compiled for the runtime detected architecture will need to be JIT-compiled. So for a library like CUDPP, if that library was compiled for cc2.0 and cc3.0, and you run it on a cc5.0 device, JIT-compiling will occur.

The JIT-compile should occur, all at once, at the point of runtime initialization. Thus you should be able to “force” this with a call to cudaFree(0); at the beginning of your application.

However, runtime initialization is a kind of “lazy” initialization or “fire and forget”. So it’s not necessarily all complete by the time control is returned to the CPU thread for processing of the next line of your code after the call to cudaFree(0); following the above example. Certainly by the time of your first kernel call (and probably before) the runtime initialization should be complete. At that point, all JIT-compiling would be complete; the JIT-compile process does not smear itself out over your entire application.

HannesF99 · July 30, 2015, 5:17pm

OK i see. Because I for sure do a few CUDA API commands (cudaSetDevice, cudaGetDeviceProperties) before the CUDA API command (cudaMemcopy - host<-> device, i think also cudaMallocPitch) which then hangs on Maxwell.

Maybe there are CUDA API functions which do not have to wait for JIT compilation to finish (like cudaSetDevice), and other CUDA API functions which have to wait for JIT compilation to finish.

Regarding this, I would really appreciate more insight in the JIT process, e.g. by some environment variable for ‘verbose mode’, which when set, lets the JIT compiler print out some logging information to a certain file.

njuffa · July 30, 2015, 5:33pm

I see I was sloppy above. I should have referred specifically to CUDA runtime context creation. The CUDA Programming Guide states (emphasis mine): “There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).[…] As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory.” This is why a call to cudaSetDevice() does not trigger JIT compilation.

You can have finer control over JIT compilation by using the CUDA driver interface instead of the CUDA runtime, as the loading and JIT compilation of code requires specific API calls at that level, i.e. it is not something that happens under the hood. This approach is typically used by apps that dynamically compile code at application run time.

In general, I would advise to avoid JIT compilation and instead building all code as a fat binary with SASS (machine code) for all currently shipping/supported architectures (sm_20 through sm_52) with PTX only for the latest architecture to be JIT compiled on future GPU architectures.

HannesF99 · July 31, 2015, 8:04am

OK, so that means

for Cuda Toolkit 5.0
-gencode arch=compute_11,code=sm_11
-gencode arch=compute_13,code=sm_13
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_30,code=compute_30

for Cuda Toolkit 7.0
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_50,code=compute_50

(excluding tegra stuff with compute capability 3.2)

is that right ? Or did I miss something ?

Unfortunately, i suppose when I compile OpenCV GPU module and CUDPP that way that will make the DLLs even bigger … They are already hundreds of megabyte …

Note there is a similar thread (CUDPP related) at
https://devtalk.nvidia.com/default/topic/546956/cudamalloc-hangs-for-several-minutes-on-titans-on-centos5_x64/

Chenghaibo · March 9, 2016, 4:05am

thanks to every people, especial to uncle joe.

Topic		Replies	Views
cuda.h error message CUDA Programming and Performance	9	6029	October 22, 2009
JIT .cu CUDA Programming and Performance	17	8073	October 13, 2010
JIT Details CUDA Programming and Performance	14	3407	January 9, 2018
CUDA GDB hang on cudamalloc(), single GPU CUDA-GDB	6	2753	May 14, 2018
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23691	July 27, 2010
cuda-gdb hangs CUDA-GDB	12	8404	May 23, 2014
cudaThreadSynchronize() stalls application CUDA Programming and Performance	10	10989	November 17, 2009
PTX jit spills registers in trivial programs CUDA Programming and Performance	9	848	February 28, 2024
New Features in CUDA 7.5 Technical Blog	66	1086	August 10, 2016
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63110	December 3, 2010

X64,VC2013,WIN8.1 , cudaMallocPitch block forever.

Related topics