Yes, the network fully runs on the DLA. See my other thread for that whole process.
The other thing I forgot to mention is that when I run the program under nvprof, all of the ranges are observed as expected. It’s only when I run nsight systems remotely that things don’t work right.
At the moment, we’re not so concerned with profiling the DLA performance. We just want to verify that the GPU is idle while the network is running on DLA in hopes of pipelining the process.
The reports have a completely different structure. When running on GPU it has CPU, Threads, CUDA, and NVTX at the top level. When running on DLA it has CPU, Processes, and iGPU at the toplevel. NVTX only shows up under the Process->process name.
And when running on the DLA the diagnostics summary has warnings about “Not all NVTX/CUDA events might have been collected” that don’t appear when running on GPU.
Actually, looking again, it seems the run on DLA stops after 100s whereas the GPU version stops at 200s. It like the profiler stops gathering events as soon as the DLA starts running? Or maybe running it under the profiler is causing the process to abort early?

