Problem with incorrect inferencing unless DVFS disabled

HypervisorX · October 24, 2021, 4:59am

Having an issue with my jetson nano devkit and was instructed to post here instead after submitting a support ticket case# 211016-000454. The problem I am experiencing is inferencing is producing the wrong result unless Dynamic Voltage and Frequency Scaling (DVFS) is disabled by running jetson_clocks. I am using the official SD image from nvidia, have done multiple wipes and reinstall of sd card and verified checksum after copying image to card to confirm transfer was both accurate and successful. When trying to run unmodified examples from the dusty-nv jetson getting started repo on GitHub the ImageNet examples fail to classify properly unless DVFS is disabled. I have tried both with and without any updates installed, tried both with a barrel jack power supply and micro usb power supply. When using barrel jack power supply, monitoring /sys/bus/i2c/drivers/ina3221x/6-0040/iio:device0/in_voltage0_input for voltage while executing the example ImageNet inferencing runs, voltage was within range min of 4984 mv, max of 5112 mv so it doesn’t seem likely to be a power supply issue.

If I manually disable DVFS with jetson_clocks then ImageNet example will ALWAYS produce an accurate result with the expected image match probability. When DVFS is not disabled (default setting) the result will nearly always be wrong (python example code normally fails with an exception related to no inference results see github issue below, c++ code simply writes no overlay at all to sample image instead of writing the match probability onto image). The only time the inferencing seems to work while DVFS is enabled is when other system activity had already push clocks/voltage up enough to get lucky though I don’t have any data to back this up and didn’t investigate this further once I discovered disabling DVFS was a workaround.

Can anyone suggest any troubleshooting or ideas that could point to something other than a hardware defect with my jetson module or dev kit? I posted an issue about my problem already here: DVFS causing consistent repeatable classification failures. Expected behavior? · Issue #1240 · dusty-nv/jetson-inference · GitHub

Thanks so much for any help or insight you can provide, it is most appreciated. I am happy to provide any additional details that may be helpful.

AastaLLL · October 25, 2021, 2:42am

Hi,

Could we reproduce this issue with the default jetson_inference example?
If yes, would you mind sharing the detailed steps of the working and non-working cases with us?

We would like to check this further to see if this is a possible issue.

Thanks.

HypervisorX · October 25, 2021, 2:02pm

Sure. So I have attached the entire detailed logs showing everything beginning with the initial headless setup, up to showing the default imagenet example being run with the default clocks settings and failing, and then using jetson_clocks to disable DVFS and run at max speed. Once jetson_clocks is run the inferencing will work until the default clocks settings are restored, at which point the inferencing will totally fail again. The logs are split up into ordered numbered parts in filename to hopefully make it easier to read and locate anything you may be looking for.
I will also include a shortened version in a code box below that contains what I thought was only the relevant output/commands to save you from needing to read through everything. If you find you want more detail about something in the shortened version you can refer to the full logs.

The main thing to notice is classification fails until jetson_clocks is run. Once run inferencing is successful until the default clocks are restored. In many places I used an echo command in the shell to include in the log either a note about the output or a command I was running.
Also of note is in the full detailed logs, there will be errors related to X11 such as [OpenGL] failed to open X11 server connection. This is because I ran this headless, but I can assure you that even if I connect a monitor keyboard and mouse the exact same problem with inferencing occurs without the X11 error which is just from it not being able to show the output image normally displayed in the popup window after running since the program is running headless. Connecting a monitor was actually one of the first things I tried when troubleshooting this before I figured out the real issue. Just wanted to mention the X11 thing so that time isn’t wasted chasing that down or wondering why its there.

all commands run immediately after initial headless setup from nvidia sd card image jetson-nano-jp46-sd-card-image.zip
SHA1 hash of zip file 522EC5C8064E9AC8A2E151D2BA806638596FD282
After format with SD Memory Card Formatter 5.0.1 image was written and verified with balena etcher
========================================================

jetson@jetson:~$ git clone --recursive https://github.com/dusty-nv/jetson-inference
....
jetson@jetson:~$ cd jetson-inference/
jetson@jetson:~/jetson-inference$ docker/run.sh
....
[jetson-inference]  Models selected for download:  3 5 14 24 28 29 32 33 35 37 39 41
[jetson-inference]  Downloading GoogleNet...
[jetson-inference]  Downloading ResNet-18...
[jetson-inference]  Downloading SSD-Mobilenet-v2...
[jetson-inference]  Downloading MonoDepth-FCN-Mobilenet...
[jetson-inference]  Downloading Pose-ResNet18-Body...
[jetson-inference]  Downloading Pose-ResNet18-Hand...
[jetson-inference]  Downloading FCN-ResNet18-Cityscapes-512x256...
[jetson-inference]  Downloading FCN-ResNet18-Cityscapes-1024x512...
[jetson-inference]  Downloading FCN-ResNet18-DeepScene-576x320...
[jetson-inference]  Downloading FCN-ResNet18-MHP-512x320...
[jetson-inference]  Downloading FCN-ResNet18-Pascal-VOC-320x320...
[jetson-inference]  Downloading FCN-ResNet18-SUN-RGBD-512x400...
Downloading pytorch-ssd base model...
2b498531d852: Pull complete 
Digest: sha256:119db75c0ad42a7380f3dfef8c05c8cd74afa86777e01efb24f0167dec133377
Status: Downloaded newer image for dustynv/jetson-inference:r32.6.1


root@jetson:/jetson-inference# cd jetson-inference/build/aarch64/bin
root@jetson:/jetson-inference/build/aarch64/bin# ./imagenet.py images/orange_0.jpg images/test/output_0.jpg 
....
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
Traceback (most recent call last):
  File "./imagenet.py", line 68, in <module>
    class_id, confidence = net.Classify(img)
Exception: jetson.inference -- imageNet.Classify() encountered an error classifying the image

root@jetson:/jetson-inference/build/aarch64/bin# exit
exit

jetson@jetson:~$ echo storing default clocks settings for restoration later
storing default clocks settings for restoration later
jetson@jetson:~$ sudo jetson_clocks --store ~/default.clocks
jetson@jetson:~$cd jetson-inference/
jetson@jetson:~/jetson-inference$ docker/run.sh 
....
root@jetson:/jetson-inference# echo with default settings and DVFS enabled classification will fail
with default settings and DVFS enabled classification will fail
root@jetson:/jetson-inference# cd build/aarch64/bin/
root@jetson:/jetson-inference/build/aarch64/bin# ./imagenet.py images/orange_0.jpg images/test/output_0.jpg
....
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
Traceback (most recent call last):
  File "./imagenet.py", line 68, in <module>
    class_id, confidence = net.Classify(img)
Exception: jetson.inference -- imageNet.Classify() encountered an error classifying the image

root@jetson:/jetson-inference/build/aarch64/bin# echo no class shown and classification error
no class shown and classification error

root@jetson:/jetson-inference/build/aarch64/bin# echo c program will not show any error but also will be missing classification on output text - output image will have no probability overlayed on top of it
c program will not show any error but also will be missing classification on output text - output image will have no probability overlayed on top of it
root@jetson:/jetson-inference/build/aarch64/bin# ./imagenet images/orange_0.jpg images/test/output_0.jpg
....
[TRT]    device GPU, networks/bvlc_googlenet.caffemodel initialized.
[TRT]    imageNet -- loaded 1000 class info entries
[TRT]    imageNet -- networks/bvlc_googlenet.caffemodel initialized.
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
[image]  saved 'images/test/output_0.jpg'  (1024x683, 3 channels)
....
root@jetson:/jetson-inference/build/aarch64/bin# echo notice no classification shown in output
notice no classification shown in output

root@jetson:/jetson-inference/build/aarch64/bin# echo now to exit docker and disable DVFS with jetson_clocks
now to exit docker and disable DVFS with jetson_clocks
root@jetson:/jetson-inference/build/aarch64/bin# exit
exit
jetson@jetson:~/jetson-inference$ sudo jetson_clocks
jetson@jetson:~/jetson-inference$ docker/run.sh 
....
root@jetson:/jetson-inference# cd build/aarch64/bin/
root@jetson:/jetson-inference/build/aarch64/bin# ./imagenet.py images/orange_0.jpg images/test/output_0.jpg
....
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
class 0950 - 0.966797  (orange)
[image]  saved 'images/test/output_0.jpg'  (1024x683, 3 channels)
....
root@jetson:/jetson-inference/build/aarch64/bin# echo now that DVFS disabled item is properly classified as class 0950 - 0.966797 orange
now that DVFS disabled item is properly classified as class 0950 - 0.966797 orange

root@jetson:/jetson-inference/build/aarch64/bin# echo c program will also properly function
c program will also properly function
root@jetson:/jetson-inference/build/aarch64/bin# ./imagenet images/orange_0.jpg images/test/output_0.jpg
....
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
class 0950 - 0.966797  (orange)
imagenet:  96.67969% class #950 (orange)
[image]  saved 'images/test/output_0.jpg'  (1024x683, 3 channels)
....
root@jetson:/jetson-inference/build/aarch64/bin# echo as shown output confirms classification also working in c program class 0950 - 0.966797 orange
as shown output confirms classification also working in c program class 0950 - 0.966797 orange

root@jetson:/jetson-inference/build/aarch64/bin# echo now to exit and restore default clocks settings to show it will fail once again
now to exit and restore default clocks settings to show it will fail once again
root@jetson:/jetson-inference/build/aarch64/bin# exit
exit
jetson@jetson:~/jetson-inference$ sudo jetson_clocks --restore ~/default.clocks 
jetson@jetson:~/jetson-inference$ docker/run.sh 
....
root@jetson:/jetson-inference/build/aarch64/bin#./imagenet.py images/orange_0.jpg images/test/output_0.jpg
....
[image]  loaded 'images/orange_0.jpg'  (1024x683, 3 channels)
Traceback (most recent call last):
  File "./imagenet.py", line 68, in <module>
    class_id, confidence = net.Classify(img)
Exception: jetson.inference -- imageNet.Classify() encountered an error classifying the image

root@jetson:/jetson-inference/build/aarch64/bin# echo as soon as default clocks restored with DVFS enabled classifcation fails once again on this jetson module
as soon as default clocks restored with DVFS enabled classifcation fails once again on this jetson module

0.initial_setup_headless.log (13.0 KB)
1.clone_git_repo.log (76.6 KB)
2.download_default_DNN_models.log (163.4 KB)
3.first_run_with_initial_network_optimization.log (977.3 KB)
4.imagenet_example_with_and_without_DVFS.log (43.3 KB)

AastaLLL · October 26, 2021, 8:58am

Hi,

Thanks for sharing the details.
We are going to reproduce this issue in our environment first.

Will share more information with you later.

Thanks.

AastaLLL · October 27, 2021, 4:53am

Hi,

We test this issue in our environment on Nano+JetPack4.6.
Both cases can work correctly without error.

Do we miss any setting for this issue?
Attached log for your reference.

inference_default-clocks.txt (7.5 KB)
inference_jetson-clocks.txt (7.5 KB)

Thanks.

HypervisorX · October 27, 2021, 6:23am

No you didn’t miss anything, I’m not surprised you weren’t able to reproduce it. That’s why I was suspecting it must be a hardware issue, but hoping it wasn’t.

AastaLLL · October 28, 2021, 1:49am

Hi,

I’m going to discuss this issue internally.
Will share more suggestions with you later.

Thanks.

AastaLLL · October 29, 2021, 7:30am

Hi,

Would you mind using another SD card to see if this issue still presents?

Thanks.

HypervisorX · October 29, 2021, 1:09pm

Sure. I’m away until November 4th, but will test that and post results as soon as I get back.

HypervisorX · November 5, 2021, 8:57pm

Just tried it with a different SD card. I used a different 64GB SanDisk SDXC card formatted with SD Association SD card formatter, then burned and verified with balena etcher on Linux. I used a different card reader, different computer, and different OS to try and eliminate those as possible factors, but unfortunately the result is still exactly the same. Even with the different SD card classification always fails until DVFS is disabled using jetson_clocks, it will work while DVFS is disabled and then as soon as default settings are restored with a reboot or using the save/restore feature of jetson_clocks it fails to classify once again using the default docker image and examples from github.

Unknown if anything else is broken but I can’t even successfully run the first example from the getting started with imagenet guide on the dusty-nv/jetson-inference github unless I run jetson_clocks first. Any other ideas on what it could be besides it being that the jetson devkit I received is faulty? That’s what I have been guessing since I originally posted the issue on github a little over a month ago, but would love for it to really be something else so I don’t have to deal with warranty return and exchange as I imagine that might take a long time and will be a pain.

HypervisorX · November 11, 2021, 10:00pm

Any other troubleshooting things I should do, or should I start a warranty claim for the defective jetson devkit?

AastaLLL · November 19, 2021, 6:41am

Hi,

It seems that there are some issues specified to your device.
Would you mind starting the RMA process below?

Thanks.

HypervisorX · December 3, 2021, 12:29am

My RMA replacement arrived today. For anyone else who may locate this issue later with the same or similar problem I wanted to post an update to confirm the issue and resolution. It was indeed a hardware defect with my Jetson nano devkit board. While I can’t say for sure what the problem was I can confirm that after receiving the RMA replacement today, using the same exact setup as before, everything is working properly without needing to disable DVFS. Thanks for the assistance, and hopefully this may be helpful to someone else experiencing the same problem.

AastaLLL · December 3, 2021, 6:35am

Thanks for sharing the following.
Good to know it works now!

system · December 17, 2021, 6:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.