cudaMemcpy operation doesn’t have an inherent association to a kernel
bytes are not allocated by
cudaMemcpy. The difference is that one tracks bytes moved by memcpy operations, and the other tracks bytes read or written by the GPU. For example, when a CUDA kernel is reading and writing data by executing CUDA C++ device code, if that causes movement of data across the DRAM bus, that will be tracked by the metric, but not by
that CUPTI operation
Thanks for the answer! So for example if I copy two vectors of 1000 floats and bring one back using cudaMemcpy I will move 1000x3x4 bytes = 12’000 bytes.
Instead If on the GPU after moving that I allocate and use two other vectors of 1000 floats, just temporary (I will not move them back to the main memory) let’s say, the data movement read by (dram_write_transaction + dram_read_transaction)*32 will be 1000x5x4 bytes = 20’000 bytes?
I’m not sure what that means. “Allocate two other vectors” on the GPU? Normally I expect allocations be done with e.g.
I don’t use CUPTI much. You can ask detailed profiler questions on the forums dedicated to them.
This question may be of interest for general understanding of what the profiler metrics (e.g.
Thanks, I found the link really useful. I am also wondering if there is a similar command to “–track-memory-allocations” too see the memory allocated with CudaMalloc but using ncu from command line instead of nvprof. Any chance? Thanks
probably best to ask that on the profiler forum. I would recommend checking nsight-systems (nsys) not ncu.