matrixMul crashes pc with Titan XP using nvprof --metrics all switch

4 weeks ago I described the original rebooting problem in another note to this forum. No one from NVidia responded (probably because I couldn’t make the code available). So, I took the time to find code in the
NVidia SDK that DOES have the problem. I did this so that you could run tests.

You said in your response that you have no log to check. You haven’t asked me for a log file.
What does your log file say?

No, I didn’t run cuda-memcheck. You have access to the same code that I do. Did you run cuda-memcheck on the SDK
matrix multiplication sample that I reported to you?

Yes, it does block my work. That has been going on for 4 weeks now. I already explained that in another bug I posted to this forum that my app has the problem.

I have no idea what you are saying or asking with this sentence …

– I think the current WAR is not using --metrics all , but only collect the metrics you want. –

Hi bz,

First of all I am extremely sorry that you have to go through this all trouble.

So, the issue for which we had repro at our local end on GTX 1070 was due to TDR. After increasing TDR we are not seeing the issue.

But since you are not driving the display through Tesla XP it is unlikely that increasing TDR will solve your issue but still you can give a try by increasing TDR to very high value. You can refer following link to increase it https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Quoting your sentence
[i]I have no idea what you are saying or asking with this sentence …

– I think the current WAR is not using --metrics all , but only collect the metrics you want. --[/i]

We tried to repro the issue with exact same setup, driver, GPU and command. But we couldn’t repro it locally hence we are finding it difficult to progress further.

Can you tell us which specific metric you are looking to profile? and if you are not seeing the issue while profiling only that metric then you will be unblocked by the issue.

I am interested in all 113 performance counters for the Titan XP. That’s why I use the --metrics all switch.

If you review this thread from the beginning, you will see that I have the problem on each of 2 Titan XP boards.

If you look at my initial contact with you I described the configuration of my machine.

“I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59.
I have a Quadro K620 and Titan XP in my machine.
I am using the matrixMul sample without modification.”

So, to be clear, you have set up a PC with Win7/64 SP1, cuda 9.0, and dev driver 388.59, a K620 and a Titan XP?

Why are you testing on a GTX 1070? A GTX 1070 isn’t a Titan XP.

Hi bz,

First of all I have gone through whole thread and very well aware of your configuration.

We tried to set it up the exact same configuration on our end to repro the issue

Configuration: VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59 with K620 and a Titan XP

But still we couldn’t repro the issue at our end.

Have you tried by increasing TDR to high value?

I apologize for misinterpreting your statement. I thought your earlier response indicated you ran your tests on the GTX 1070 and not the Titan XP.

As for the TDR experiment, I ran it 3 times. I added the TdrDelay to the registry. Then I used the
values 60, 300, and 3600. As I understand it, the unit of time is seconds. In all 3 cases my machine reboot after about 20 seconds of run time.

Was that the correct key? My registry already had the TdrLevel = 0 key.

–Bob

Bob,

It is unfortunate that you are still able to repro the issue even after increasing TDR. There is one more thing you can do to unblock yourself. Can you profile your app on linux platform ? In this way you can profile all performance metric. Until then we will also give it try to repro the issue at our end. As earlier mentioned since we don’t have local repro of your issue we are finding it difficult to proceed further.

Bob,

Sorry to trouble you by asking you to try so many experiment. You can also try by changing Tesla XP mode to TCC

here is the link which explains how to do.
http://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/tesla_compute_cluster.htm

To change the TCC mode, use the NVIDIA SMI utility. This is located by default at C:\Program Files\NVIDIA Corporation\NVSMI. Use the following syntax to change the TCC mode:

nvidia-smi -g {GPU_ID} -dm {0|1}
0 = WDDM

1 = TCC

Hi, I don’t have a linux machine to profile on.
I did change it to TCC. It still crashes.