Jetson AGX Xavier DDR Test

644639020 · March 18, 2020, 10:00am

Hi,
I am new to NVIDIA Jetson Xavier, and I wanna do a performance test and see if it fits my project. I tried both running cuda-10.0 samples with nvprof and nv-nsight-cu-cli but neither of them returned anything related to dram_utilization. I have checked their “–query-metrics” and found nothing. Please tell me how to check anything related to dram if anyone knows.

Thanks
Sam

AastaLLL · March 19, 2020, 2:13am

Hi,

Here are two recommended tools for memory profiling.

1. You can check the on time system status with tegrastats, including memory usage.

$ sudo tegrastats

2. Nsight system can give you some memory bandwidth report.

Thanks.

644639020 · March 20, 2020, 8:12am

Hi AastaLLL,

Thanks for your quick response! I have tried use both tools. Unfortunately, tegrastats only gives me the overall status rather than specific app that I launched. On the other hand, Nsight Systems gives me a lot of info but I did not find anything related to device memory. I have also checked nvprof, nvvp as well as Nsight Compute, the device memory section is left n/a. It does not make sense cause the model I load from device memory is over 50MB and system_read_bytes in nvprof only shows 1.2MB. Did I miss anything?

PS: Hope you could also explain how this system_read_bytes come from.

Thanks again for your response!
Sam

AastaLLL · March 31, 2020, 7:29am

Hi,

Sorry for the late update.

Do you use TensorRT or cuDNN related API?
A common cause is that it takes some memories to load the libraries, which won’t be shown on the profiling tool.

Here is a related topic for checking the memory usage from the libraries:

Thanks.

644639020 · March 31, 2020, 7:53am

Hi AastaLLL,

Thanks for the update and I will take a look at the related topic right away!
Yes, I used TensorRT APIs for inferencing data. I parsed a model and created tensorRT engine. After that I created an execute context. I did another test using command line version of Nsight System and import the report to my desktop which gave me more information than GUI version, and here is what I’ve got.

I was wondering if anything happened to data input and output during cudaStreamSynchronize part. Hope you could explain this a bit more for me.

Thanks!

AastaLLL · April 1, 2020, 2:04am

Hi,

It looks like cudaStreamSynchronize takes really long time.

To give further suggestion, would you mind to share a simple reproducible source and the model with us?
Thanks.

644639020 · April 1, 2020, 5:24am

Hi AstaLLL,

Thanks for the quick response!
Sure, I have prepared a simple test using ResNet-50 model and trtexec under tensorrt sample folder. The difference between my code and this simple test is that I synchronously execute the inference, and my code has cudaStreamSynchronize and this one has cudaEventSynchronize. But, they both have similar issue, cudaEventSynchronize also takes a lot of time, and through profiling tools (Nsight System, nvprof), I could not see any data transfering during that time period.

Here is the command line I used to execute:
./trtexec --avgRuns=100 --deploy=resnet50.prototxt --int8 --batch=8 --iterations=1000 --output=fc1000 --useDLACore=0 --useSpinWait --allowGPUFallback
I have upload the the model file that I used is in this link: https://drive.google.com/open?id=1ohTe5C2JZIR7tJbqwf5uSxJm2flm1Ecp

Thanks for your time!

AastaLLL · April 1, 2020, 7:28am

Hi,

Just want to confirm the environment setting first.
Have you maximized the device performance with following command(order is matter):

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

644639020 · April 1, 2020, 7:45am

Yes, I have maximized the device performance before profiling.

AastaLLL · April 8, 2020, 9:38am

Hi,

I try to access your model but doesn’t have the access permission.
Could you help to enable it?

Thanks.

644639020 · April 8, 2020, 9:47am

Hello,

Seems like there is something wrong with my google drive. I uploaded it to my github and you can download it from the link below. Please let me know if you need more info.

Thanks.

AastaLLL · April 15, 2020, 9:10am

Hi,

We found it is possible that cudaStreamSynchronize takes long times as you reported.

For example,
If CPU calls cudaStreamSynchronize right after launching a TensorRT job with enqueue(…), it must to wait for the output from TensorRT.
So when a model takes longer on TensorRT, the cudaStreamSynchronize time also becomes larger.

We test your sample with trtexec and the average execution time with batchsize=8 is 20 ms.
You can also try it with trtexec to see the end-to-end latency first.

Thanks.

644639020 · April 15, 2020, 9:22am

Hi,

Thanks for your reply! I have tested trtexec with batchsize=1 and I agree with what you said, the average execution time is lower. Now, I just want to ask one more question, is it possible to see any Cuda Kernels executed during TensorRT, for example, using nvprof?

Thanks.

AastaLLL · April 16, 2020, 5:05am

Hi,

YES.
You can use nvvp which is located at /usr/local/cuda-10.0/bin/nvvp and profile the app remotely.
nvvp will profile the app with nvprof.

Thanks.

644639020 · April 16, 2020, 7:55am

Hi,

I have managed to use nvvp and profile the trtexec remotely. I tested both with and without DLA (on GPU). The first figure shows the result with DLA:

The next figure shows the result without DLA (on GPU):

Seems like nvprof can detect more Cuda Kernels when the program is running on GPU. For DLA case, there is just a gap and I don’t see any related Kernels. Could you explain how to see those Cuda Kernels for DLA case?
The test command line for DLA is :
./trtexec --avgRuns=1 --deploy=resnet50.prototxt --int8 --batch=1 --iterations=1 --output=fc1000 --useDLACore=1 --useSpinWait --allowGPUFallback
and without DLA is:
./trtexec --avgRuns=1 --deploy=resnet50.prototxt --int8 --batch=1 --iterations=1 --output=fc1000 --useSpinWait --allowGPUFallback

Thanks for your time.

AastaLLL · April 17, 2020, 7:30am

Hi,

Sorry that our profiler doesn’t support DLA yet.

In the DLA pipeline, you can still capture some kernel usage since DLA use GPU for some data conversion.
However, the inference job performed via DLA won’t show on the timeline.

Thanks.