Can't use Nsight Systems 2020.3.1.72 (Windows) Can't load profilings

With version 2020.3, with older or newewst drivers for Quadro GPU’s, after profiling for 60 seconds, the NSight Systems hardware get’s stuck trying to load the newly generated profile.

Doing the same, same server, same drivers, etc, with 2020.2 version works, and does not take too much time to open the profile.

We are currently using 2020.2 version, since this one works quite smooth. It’s not an issue for us, since we can use 2020.2, but it would be cool that someone takes a look at the issue, before we are forced to use a newer version of NSIGHT Systems, to profile Ampere GPU’s.

Windows 10 Version 1607 Build 14393.3866
CPU: AMD EPYC 7401P
64 GB of DDR4 RAM
3 GPU’s all of them Quadro RTX4000

Thanks!

Sorry it took so long to get back to you. There was a regression in performance in the 2020.3 release for specific kinds of runs, that I think you ran into.

I suggest you download and install the 2020.4 which is available now via developer.nvidia.com

Hi! We are having the same problems with 2021 versions.

Is any one else complaining?

Are you running via CLI or GUI?

Can you give me the command line/options you were using?

How bit is the results file (.qdrep).

We are using the GUI.

In the command line we simply call our software, a Windows executable.

When selecting Strat profiling manually, it does not generate a results file, it opens a window with this messages:

ApplicationExitedBeforeProfilingStarted (1104) {
RuntimeError (120) {
OriginalExceptionClass: class boost::exception_detail::clone_impl
OriginalFile: D:\TC\20a3cfcd1c25021d\QuadD\Host\Analysis\Clients\AnalysisHelper\AnalysisStatus.cpp
OriginalLine: 80
OriginalFunction: class Nvidia::QuadD::Analysis::Data::AnalysisStatusInfo __cdecl QuadDAnalysis::AnalysisHelper::AnalysisStatus::MakeFromErrorString(enum Nvidia::QuadD::Analysis::Data::AnalysisStatus,enum Nvidia::QuadD::Analysis::Data::AnalysisErrorType::Type,const class std::basic_string<char,struct std::char_traits,class std::allocator > &,const class boost::intrusive_ptr &)
ErrorText: The target application exited before profiling started.
}
}

When not, it generates a .qrep file, but it crashes before our software loads the interesting part to be profiled. How can I share the .qrep file?

The issue happens when our software starts allocating a lot of memory (both in GPU and CPU).

It does not crash when not using NSight, and it works with older NSight versions, though it does crash also. Simply, with the older versions we have enough time to get some profiling.

The System memory is 64GB or ram and we are using arroun 30GB, and the GPU’s are RTX 4000, we use 3 of them, and none of them get’s higher than 6GB of memory usage. That is, without NSight Systems.

How big is the .qdrep file?

It is 29.4MB

You should be able to zip it up and upload it here. Or if that is a problem, you can email it to me (hwilper@nvidia.com)

Here you have:
Report 19.zip (22.6 MB)

Your application, does it spawn child processes to do the actual work? Do you have trace process tree turned on?

You can see the config we use in the picture.

And well, to be precise, since process != thread. Then no, we don’t use many processes, we have a single process, that creates many CPU threads. Several of this CPU threads ara doing CUDA work in the same or different GPU’s. 100% of the threads do work with only one GPU (unless there is some HtoD or DtoH or peer DtoD involved), since having threads changing cuda contexts in Windows is expensive.

I don’t see an option that says “process tree” in the interface any way… does it have another name?

Hi! Any news?

We have an NVIDIA Driver crash reproducer, that reproduced a bug in NVIDIA drivers prior to 461.09. Now we are modifying it so that it can reproduce another crash we are facing with drivers starting on 461.09 (bug 3221330).

The reproducer does not reproduce the second crash yet, but it does reproduce the issue with NSight Systems 2021.2.1 mentioned in this post.

It’s a very simple code made of less than 200 code lines of C++, with a cmake, and in our case, compiled with Visual Studio 2017.

When trying to profile this code, to balance the load of the GPU’s, we find the same issue as with our complete software.

Do you want me to share this code with you?

Thanks!

Yes, please share this with us.

@liuyis, can you take a look at this?

Here you have it.

It is 4 files at the end, 2 for a dummy kernel, 1 for the logger system, and the actual main cpp file.

NSightSystems2021.2.1IssuesReproducer.zip (6.3 KB)

Thanks for sharing, I’ll take a look.

I played a bit with the static variables at the beginning of CUDAErrorReproducer.cpp, and if you lower the values enough, it may work, and capture something.

The values I left there are generating a compute usage of about 65%, and PCIe usage of about 32%, in a Quadro RTX 4000. With this, I’m already unable to profile with NSight Systems 2021.2.1, and driver 461.40.

Thanks!

Same issue with:

Windows 10 Enterprise 21H1
Driver 471.11
NSightSystems 2021.2.4

A single NVIDIA RTX A4000

I can reproduce the crash successfully with the reproducer you shared. Just to confirm, this crash only happens when CUDA trace is enabled, right?

An initial investigation indicates the crash happens in CUPTI library (which we relies on to get CUDA trace data). I will follow up with CUPTI team.

Yes, it happens when trying to capture CUDA trace.

There is a bug number opened for this issue. Are you working with it? Bug number: 200746470

If not, could you synch, or ask the CUPTI team to synch with me and the people involved in this bug number? I have access to the bug through the partners portal, so we can communicate there.

Thanks!