Error running nvprof with NPP-based GPU computation

Hello,

I am developing an application for NVIDIA Drive PX 2 AutoChauffeur platform (Tegra X2 with GP106 GPU running Drive Linux 5.0.5.0b). A portion of this application is shown below:

<some CPP code>
<custom CUDA kernel>
<some CPP code>
<NPP-based CUDA kernel>
<some CPP code>

The application runs a loop of the above sequence on multiple images.

When I execute the application by itself on the target, it runs fine without any problems. But when I profile the application executable using nvprof, the CUDA kernel written using NPP (NVIDIA Performance Primitives) throws a runtime error in the first loop iteration. In subsequent iterations, both the custom CUDA kernel and the NPP-based kernel throw errors related to device memory allocation.

The NPP-based kernel looks somewhat like this:

...
cudaMalloc();
cudaMemcpy(Host2Device);
cudaMalloc();
cudaMemcpy(Host2Device);
status = npp_kernel();
cudaDeviceSynchronize();
cudaMemcpy(Device2Host);
cudaFree();
cudaFree();
...

Initially, the call to cudaDeviceSynchronize() was returning the cudaErrorLaunchFailure error code (4). When I removed the synchronization call, the cudaMemcpy(Device2Host) started throwing runtime error. Additionally, the nvprof also shows the following output:

nvrm_gpu: Bug 200215060 workaround enabled.
==<PID>== NVPROF is profiling process <PID>, command: ./<myapp> <testinput> <supportfiles>
nvrm_gpu: Bug 200215060 workaround enabled.
==<PID>== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
...
==<PID>== Profiling application: ./<myapp> <testinput> <supportfiles>
==<PID>== Warning: Found X invalid records in the result.
==<PID>== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==<PID>== Profiling result:
...

Has anyone ever faced this kind of problem before? If so, please help me out.

Thank you.

Can you try adding a cudaDeviceSynchronize() after the cudaMemcpy() calls in lines 3, 5 and 8, then see if it still crashes?
It is possible that, for example, the cudaFree() calls in lines 9 and/or 10 could be trying to release the memory while the copy at line 8 is still happening.

If the actual code looks like this, that is, after running the kernel on line 6, and not doing anything else with the data on host side after the copy on line 8, then you don’t need this copy at all. Just free the memory after the cudaDeviceSynchronize() on line 7.

If you’re not doing complete CUDA and NPP library error checking, I would add that first, and also run your code with cuda-memcheck. If that turns up any errors, there is no point proceeding to use nvprof; instead, resolve the errors first. The launch failure error 4 is usually not due to a memory allocation problem, but a kernel execution problem. If you have one of those, you need to sort it out.

After that, its possible that if your application is using nearly all the memory on the GPU, that nvprof may have some additional memory usage requirements that are putting you over the memory limit. If that seems to be the case, I would build a test version of your application that reduces the scope somehow, so that you can profile it if you need to. An example of reducing the scope might be working on a similar but smaller data set. Having said that, I have never seen reports of that, and my initial read of your report does not appear to be indicating an actual memory issue, but a kernel execution error, as already indicated.