Following the MIG guide and this tutorial on best practices for using MIG → testing on A100 40gb sxm.
I am noticing some increase in latency per workload (rodinia - myocyte) when I run multiple workloads (batch size = 50) together in 7 MIG (each with GI Profile: 1g.5gb)instances on an A100
, as compared to running the same workload either on the full GPU (7g.40gb) -
or just running a single workload(batch size=1) on one of the 7 MIG slice as above (GI Profile: 1g.5gb) -
That is, the time per workload increases as shown when all 7 slices of MIG are concurrently running a workload. How do I determine (nsight compute or nsight system?) exactly what is causing this increase? As far as I understand, MIG distributes compute, memory and bandwidth resources equally between MIG instances with the same profile. Thanks.
You can find tutorials on how to use nsight compute and nsight systems. The usual methodology is to start with nsight systems, get a timeline for the workflow, then look for differences in the various cases. If you get down to observing that a specific kernel takes longer to run, then nsight compute would be the tool to study that.
You may already have some “navigational” info in your print-outs. I’m not sure I understand your description perfectly, but it seems that maybe the first and last print-outs are most comparable. If that is the case, we see that the D->H copy, for example, is almost the same in terms of reported duration. But the “ALLOCATE CPU MEMORY AND GPU MEMORY” section takes about 7x longer in the first case vs. the last (and seems to account for nearly all of the difference in total duration reported for those two cases.)
The kernel execution is also longer in the first case vs. the last, but do note that as a percentage of the total runtime or even the change in runtime between the two cases, it is only a tiny contributor.
1 Like
Go it. I will start with nsight system first then. Thanks!
As for my description -
- The first screenshot is when an A100 is divided into 7 MIG slices (1g.5gb) and all the slices are running the workload concurrently. Here, if I observe the runtime of a workload on any of the slices, the per workload completion time is around 4.8 seconds
- The second screenshot is when I am running the same workload but in the full GPU configuration - that is 7g.40gb. So the workload has the entire GPU to run on.
- Third screenshot is on the same configuration as the first. A100 is divided into 7 MIG slices (each 1g.5gb) but only 1 slice is running the workload, which runs in 3.65 seconds.
The only difference between 1 and 3 is that in the first test scenario, there are 7 workloads running concurrently and in the third test scenario only 1 workload is running. Since MIG slices are independent of each other (full hardware separation), I was wondering why running multiple workloads causes the runtime of the workload to increase (from 3.65 to 4.8).
Is there any documentation/description on various metrics provided by nsight-compute?
Thanks.
nsight compute documentation is here. There is a metrics guide here which includes a metrics reference section at the end. There is a separate forum for nsight compute specific questions here and nsight systems forums here. There are also various blogs available and youtube videos. Here is an nsight compute blog series, here is an example of an nsight compute tutorial video.
Nsight compute is mostly focused on kernel profiling. I would suggest starting with nsight systems unless the thing you want to explore is the difference in the reported kernel execution time. Based on pareto that is not where I would suggest starting.
1 Like
I’ll add that back then just in case.
Do you have documentation/tutorials for nsight-systems similar to ones you linked above for nsight-compute? I’m checking this out currently, but was wondering if there are any other helpful presentations available.
Thank you. This should be enough for me to get started.