I would like to profile the CPU utilization and GPUs of a python MPI application.
I need to measure at least
- the initialization time
- the transfer from CPU to GPU
- the time needed to allocate memory on GPU
- the overlap between transfer and compute (if the compute start before the whole data are sent to the GPU)
- the freeing time at the end of the execution
- the memory used by the CPU and the GPU
From what I read, I thought that nsys is the best tool but I cannot extract these metrics.
Can you tell me what I have to use?