Hi all, I am working on some DL code with using CUDA code, I use nsight system to profile my code, I find an unknown Memcpy HtoD op in it, however I don’t find any explicit code to call this memcpy op.
I wonder how I can figure out the cause of it? Can you give me some ideas? Thanks a lot!
The profiler is usually a good starting point. An unknown memcpy could be as a result of a library call. For example most modern DL software stacks running on CUDA GPUs will be usuing CUDNN library at some level. The CUDNN library (or any other GPU library) could be making a memcpy call under the hood, as a result of one of the library calls.
If you really want to isolate where it is coming from, you can use the NVTX ranges to successively mark smaller and smaller ranges, until you have a range so small that you can identify where it is coming from. Here is a simple example/starting point.
Thanks for your reply, but I’m working on cuda level code now.
I found out that memcpy happens when calling the code defined cuda kernel, I think it might be caused by parameter passing because there is no any explicit memcpy call in cuda code.
Is that possible in cuda? I feel weird because other kernel calls don’t show this, thanks a lot!
Typically I would say no. A kernel call does not trigger a H2D memcpy operation as reported by the profiler, usually. Parameter passing associated with a kernel call is not visible as a memcpy operation.
There might be something like this if you were in a UM environment where concurrentManagedAccess property is false, but that would be unusual in a typical DL software stack/setting, and it would be equivalently visible on other kernel calls (and the profiler would not refer to it as a memcpy operation but a data migration operation, that is the profiler reports it somewhat differently.)
If you have the behavior narrowed down to a single kernel call, you should be able to fairly quickly use divide and conquer to identify what is going on. I’m afraid I probably won’t have further guesses for you.
thank you very much! Your reply was very helpful to me. I run the code in a NGC pytorch docker, I will futher investigate it.