HDMI was flicker on demo kit when running stress on gpu

bill_tu · April 15, 2020, 10:50am

Hi
We use nano demo kit and running cude but find HDMI output was flicker.

our test setp:
1. install phoronix-test-suite_9.4
2. Open terminal and use “export TOTAL_LOOP_TIME=99999999999999999999”
3. Run “phoronix-test-suite stress-run pts/cuda-mini-nbody”

Bill

carolyuu · April 16, 2020, 2:53am

Hi Bill,

Follow your steps to run stress, but got below errors:

$ phoronix-test-suite stress-run pts/cuda-mini-nbody
STRESS-RUN ENVIRONMENT VARIABLES:
PTS_CONCURRENT_TEST_RUNS: Set the PTS_CONCURRENT_TEST_RUNS environment variable to specify how many tests should be run concurrently during the stress-run process. If not specified, defaults to 2.
TOTAL_LOOP_TIME set; running tests for 99999999999999999999 minutes
[PROBLEM] pts/cuda-mini-nbody-1.1.1 is not installed.

We don’t have experience of using cuda-mini-nbody on Jetson platforms.
Please share how can we install it. Thanks!

HuiW · April 16, 2020, 8:05am

Hi

$phoronix-test-suite stress-run install pts/cuda-mini-nbody-1.1.1

works

carolyuu · April 16, 2020, 9:49am

Thanks HuiW.

Hi Bill,

After run below command, I can install cuda-mini-nbody-1.1.1 now.

$ phoronix-test-suite install pts/cuda-mini-nbody-1.1.1

Run

$ phoronix-test-suite stress-run pts/cuda-mini-nbody

STRESS-RUN ENVIRONMENT VARIABLES:

PTS_CONCURRENT_TEST_RUNS: Set the PTS_CONCURRENT_TEST_RUNS environment variable to specify how many tests should be run concurrently during the stress-run process. If not specified, defaults to 2.

TOTAL_LOOP_TIME set; running tests for 99999999999999999999 minutes

CUDA Mini-Nbody 2015-11-10:
pts/cuda-mini-nbody-1.1.1
Graphics Test Configuration
1: Original
2: SOA Data Layout
3: Flush Denormals To Zero
4: Cache Blocking
5: Loop Unrolling
6: Test All Options
** Multiple items can be selected, delimit by a comma. **

I select item 6 (Test All Options), running about 1 hour, I don’t see HDMI flicker issue.
Are you enable max performance mode before running?
$ sudo nvpmodel -m 0 ;
$ sudo jetson_clocks

bill_tu · April 17, 2020, 1:19am

dear Carolyuu.

option “1,2” to running 1~2 day , The HDMI will show flicker.

Bill

WayneWWW · April 17, 2020, 6:21am

Hi,

Please share your tegrastats when you run this test.
Also, which release are you using? Do you use sdcard image or sdkmanager?
Can you reproduce this issue on more than 2 jetson nano modules?

bill_tu · April 17, 2020, 6:55am

dear WayeWWW.

Please share your tegrastats when you run this test.

–>update test reilt .
Also, which release are you using?
–>We use Jatpack 4.3(R32.3.1)
Do you use sdcard image or sdkmanager?
–>We tested SDcard version on demo kit and use our carried board (use eMMC Nano) and same issue was happen.
Can you reproduce this issue on more than 2 jetson nano modules?
–>We had test three nano module and find same issue.

WayneWWW · April 17, 2020, 7:01am

Hi bill_tu,

We can see the black line on your screenshot. But could you share the tegrastats as a file instead of a picture?
Please always share the log as text file instead of picture unless you cannot share it as text.

Also, does this always need to run 1~2 days to see the issue?

carolyuu · April 20, 2020, 1:02am

Hi Bill,

We put Nano-B01 with r32.3.1 to running over-weekend (2 days), but still can’t reproduce flicker issue.

Tegrastatus:
RAM 2099/3956MB (lfb 105x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [35%@1479,100%@1479,0%@1479,0%@1479] EMC_FREQ 3%@1600 GR3D_FREQ 99%@921 APE 25 PLL@34C CPU@39.5C PMIC@100C GPU@36C AO@42.5C thermal@38C POM_5V_IN 6239/6311 POM_5V_GPU 3298/3299 POM_5V_CPU 1033/1066

RAM 2099/3956MB (lfb 105x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,0%@1479] EMC_FREQ 3%@1600 GR3D_FREQ 99%@921 APE 25 PLL@34C CPU@39.5C PMIC@100C GPU@36C AO@42.5C thermal@37.75C POM_5V_IN 6279/6309 POM_5V_GPU 3298/3299 POM_5V_CPU 1033/1064

I have set max performance mode before running.
$ sudo nvpmodel -m 0
$ sudo jetson_clocks

bill_tu · April 20, 2020, 1:34am

Hi Carol.

This week , we had run again of Nano demo kit(SD card version) on CUDE loading. 
The flicker was happen and system was hand up (ex: keyboard and mouser not working) , We had unplug power adapter and plug in to open system .

Our test config.:
1. use MAXIM performanace.
2. use 25 temp .
3. run CUDE loading.
4. use Nano demo kit on SD card version.

Bill

WayneWWW · April 20, 2020, 3:30am

Hi,

use 25 temp .

run CUDE loading.

What are 25 temp and CUDE?

bill_tu · April 20, 2020, 4:05am

dear Wayne.

Use 25 temperature mean room temperature. 
Run CUDE loading =install phoronix-test-suite and keep "phoronix-test-suite stress-run pts/cuda-mini-nbody" to ruuning .

Bill

WayneWWW · April 20, 2020, 4:10am

Hi Bill,

Yes, that is how we test.
We cannot reproduce your case.

How many devices have you tried? Is this issue only happened to specific nano?
And please do reply my question above in #6.

HuiW · April 20, 2020, 9:18am

Hi WayneWWW,

I got the issue too.
I uploaded the tegrastats and log information I got In my case. (see attached)

The settings:
Nano B01 evb board
Jatpack 4.3(R32.3.1)
use sdkmanager
test two Nanos

Thank you,display_start_stats.log (87.4 KB) dispaly_error_3.log (128.4 KB)

WayneWWW · April 20, 2020, 9:28am

Hi HuiW,
Thanks for sharing the log. We are trying to reproduce this issue again.

According to the video file in your mail, I notice you are using the terminal on screen but our test was on remote access with ssh. The tegrastats lines showing up on your screen is causing more gpu rendering in your case than ours. I guess that is the reason we didn’t reproduce error.

To make this bug more clear, it is a rendering issue when gpu is under stress test (99%).

WayneWWW · April 22, 2020, 2:56am

Hi HuiW,

This error log indicates the problem is from userspace tool but not nvgpu driver. Could you try other tool to run the gpu stress?

  [ 9378.548115] nvgpu: 57000000.gpu gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR]  Channel 491 to be removed from TSG 12 has NEXT set!
    [ 9378.560736] nvgpu: 57000000.gpu          gk20a_tsg_unbind_channel:164  [ERR]  Channel 491 unbind failed, tearing down TSG 12

bill_tu · April 22, 2020, 3:43am

dear Wayne.

  We had try to run CPU load and same issue happen. 
  CPU loading command: 1.export TOTAL_LOOP_TIME=99999999999999999999
                                             2. phoronix-test-suite stress-run pts/stress-ng

Bill

WayneWWW · April 22, 2020, 4:58am

Our team is still checking the code of phoronix. Will update later.

HuiW · May 4, 2020, 3:33am

Hi WayneWWW,

Thank you for your great support.
Is there any update from your team?

Thanks

WayneWWW · May 4, 2020, 5:48am

Hi HuiW,

Sorry, we have some experimental patch but not yet fixed this issue.

Topic		Replies	Views
problem running demos CUDA Programming and Performance	9	8225	January 1, 2009
L4T 32.4.2 - GPU error on boot Jetson Nano boot , nvbugs	42	3813	October 15, 2021
GPU in state where results are not reproducible! CUDA Programming and Performance	50	16828	November 2, 2012
Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes CUDA Programming and Performance	35	37651	February 1, 2011
Gk20a and Jetson Nano crash Jetson Nano kernel , nvbugs	45	4531	October 16, 2020
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4339	June 9, 2009
Pascal Titan X's GPU's falling off the bus Linux	0	891	December 29, 2016
System hangs with drivers 319.23, 319.32, 325.08 and others - simple test case included Linux	17	9493	July 1, 2014
Gpu error on boot Jetson Nano boot	11	3388	October 15, 2021
Jetson Nano running fine for 3 months...now shows 2 nvidia screens and nothing... Jetson Nano	18	1027	October 18, 2021

HDMI was flicker on demo kit when running stress on gpu

Related topics