I am developing an application for NVIDIA Drive PX 2 AutoChauffeur platform (Tegra X2 with GP106 GPU running Drive Linux 220.127.116.11b). A portion of this application is shown below:
<some CPP code> <custom CUDA kernel> <some CPP code> <NPP-based CUDA kernel> <some CPP code>
The application runs a loop of the above sequence on multiple images.
When I execute the application by itself on the target, it runs fine without any problems. But when I profile the application executable using nvprof, the CUDA kernel written using NPP (NVIDIA Performance Primitives) throws a runtime error in the first loop iteration. In subsequent iterations, both the custom CUDA kernel and the NPP-based kernel throw errors related to device memory allocation.
The NPP-based kernel looks somewhat like this:
... cudaMalloc(); cudaMemcpy(Host2Device); cudaMalloc(); cudaMemcpy(Host2Device); status = npp_kernel(); cudaDeviceSynchronize(); cudaMemcpy(Device2Host); cudaFree(); cudaFree(); ...
Initially, the call to cudaDeviceSynchronize() was returning the cudaErrorLaunchFailure error code (4). When I removed the synchronization call, the cudaMemcpy(Device2Host) started throwing runtime error. Additionally, the nvprof also shows the following output:
nvrm_gpu: Bug 200215060 workaround enabled. ==<PID>== NVPROF is profiling process <PID>, command: ./<myapp> <testinput> <supportfiles> nvrm_gpu: Bug 200215060 workaround enabled. ==<PID>== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements ... ==<PID>== Profiling application: ./<myapp> <testinput> <supportfiles> ==<PID>== Warning: Found X invalid records in the result. ==<PID>== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion. ==<PID>== Profiling result: ...
Has anyone ever faced this kind of problem before? If so, please help me out.