Hi, I am profiling a simple kernel which has an object passed to it by value (via copy constructor with cudaMalloced pointer variables). The runtime of cudaFree() inside a destructor that is called on kernel exit seems to be the runtime of a kernel (the same duration has the kernel itself when profiling with “Profile CUDA Application”), while the rest of cudaFree() calls are being measured correctly. Is this a known issue at a time or is this something normal and language-specific? Thanks!
using VS2010 Nsight 3.0, CUDA 5.0.