Hello,
I am developing an application for NVIDIA Drive PX 2 AutoChauffeur platform (Tegra X2 with GP106 GPU running Drive Linux 5.0.5.0b). A portion of this application is shown below:
<some CPP code>
<custom CUDA kernel>
<some CPP code>
<NPP-based CUDA kernel>
<some CPP code>
The application runs a loop of the above sequence on multiple images.
When I execute the application by itself on the target, it runs fine without any problems. But when I profile the application executable using nvprof, the CUDA kernel written using NPP (NVIDIA Performance Primitives) throws a runtime error in the first loop iteration. In subsequent iterations, both the custom CUDA kernel and the NPP-based kernel throw errors related to device memory allocation.
The NPP-based kernel looks somewhat like this:
...
cudaMalloc();
cudaMemcpy(Host2Device);
cudaMalloc();
cudaMemcpy(Host2Device);
status = npp_kernel();
cudaDeviceSynchronize();
cudaMemcpy(Device2Host);
cudaFree();
cudaFree();
...
Initially, the call to cudaDeviceSynchronize() was returning the cudaErrorLaunchFailure error code (4). When I removed the synchronization call, the cudaMemcpy(Device2Host) started throwing runtime error. Additionally, the nvprof also shows the following output:
nvrm_gpu: Bug 200215060 workaround enabled.
==<PID>== NVPROF is profiling process <PID>, command: ./<myapp> <testinput> <supportfiles>
nvrm_gpu: Bug 200215060 workaround enabled.
==<PID>== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
...
==<PID>== Profiling application: ./<myapp> <testinput> <supportfiles>
==<PID>== Warning: Found X invalid records in the result.
==<PID>== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==<PID>== Profiling result:
...
Has anyone ever faced this kind of problem before? If so, please help me out.
Thank you.