Jetson AGX Xavier DDR Test

Hi,
I am new to NVIDIA Jetson Xavier, and I wanna do a performance test and see if it fits my project. I tried both running cuda-10.0 samples with nvprof and nv-nsight-cu-cli but neither of them returned anything related to dram_utilization. I have checked their “–query-metrics” and found nothing. Please tell me how to check anything related to dram if anyone knows.

Thanks
Sam

Hi,

Here are two recommended tools for memory profiling.

1. You can check the on time system status with tegrastats, including memory usage.

$ sudo tegrastats

2. Nsight system can give you some memory bandwidth report.

Thanks.

Hi AastaLLL,

Thanks for your quick response! I have tried use both tools. Unfortunately, tegrastats only gives me the overall status rather than specific app that I launched. On the other hand, Nsight Systems gives me a lot of info but I did not find anything related to device memory. I have also checked nvprof, nvvp as well as Nsight Compute, the device memory section is left n/a. It does not make sense cause the model I load from device memory is over 50MB and system_read_bytes in nvprof only shows 1.2MB. Did I miss anything?

PS: Hope you could also explain how this system_read_bytes come from.

Thanks again for your response!
Sam

Hi,

Sorry for the late update.

Do you use TensorRT or cuDNN related API?
A common cause is that it takes some memories to load the libraries, which won’t be shown on the profiling tool.

Here is a related topic for checking the memory usage from the libraries:

Thanks.

Hi AastaLLL,

Thanks for the update and I will take a look at the related topic right away!
Yes, I used TensorRT APIs for inferencing data. I parsed a model and created tensorRT engine. After that I created an execute context. I did another test using command line version of Nsight System and import the report to my desktop which gave me more information than GUI version, and here is what I’ve got.


I was wondering if anything happened to data input and output during cudaStreamSynchronize part. Hope you could explain this a bit more for me.

Thanks!

Hi,

It looks like cudaStreamSynchronize takes really long time.

To give further suggestion, would you mind to share a simple reproducible source and the model with us?
Thanks.

Hi AstaLLL,

Thanks for the quick response!
Sure, I have prepared a simple test using ResNet-50 model and trtexec under tensorrt sample folder. The difference between my code and this simple test is that I synchronously execute the inference, and my code has cudaStreamSynchronize and this one has cudaEventSynchronize. But, they both have similar issue, cudaEventSynchronize also takes a lot of time, and through profiling tools (Nsight System, nvprof), I could not see any data transfering during that time period.

Here is the command line I used to execute:
./trtexec --avgRuns=100 --deploy=resnet50.prototxt --int8 --batch=8 --iterations=1000 --output=fc1000 --useDLACore=0 --useSpinWait --allowGPUFallback
I have upload the the model file that I used is in this link: https://drive.google.com/open?id=1ohTe5C2JZIR7tJbqwf5uSxJm2flm1Ecp

Thanks for your time!

Hi,

Just want to confirm the environment setting first.
Have you maximized the device performance with following command(order is matter):

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Yes, I have maximized the device performance before profiling.
image

Hi,

I try to access your model but doesn’t have the access permission.
Could you help to enable it?

Thanks.

Hello,

Seems like there is something wrong with my google drive. I uploaded it to my github and you can download it from the link below. Please let me know if you need more info.

Thanks.

Hi,

We found it is possible that cudaStreamSynchronize takes long times as you reported.

For example,
If CPU calls cudaStreamSynchronize right after launching a TensorRT job with enqueue(…), it must to wait for the output from TensorRT.
So when a model takes longer on TensorRT, the cudaStreamSynchronize time also becomes larger.

We test your sample with trtexec and the average execution time with batchsize=8 is 20 ms.
You can also try it with trtexec to see the end-to-end latency first.

Thanks.

Hi,

Thanks for your reply! I have tested trtexec with batchsize=1 and I agree with what you said, the average execution time is lower. Now, I just want to ask one more question, is it possible to see any Cuda Kernels executed during TensorRT, for example, using nvprof?

Thanks.

Hi,

YES.
You can use nvvp which is located at /usr/local/cuda-10.0/bin/nvvp and profile the app remotely.
nvvp will profile the app with nvprof.

Thanks.

Hi,

I have managed to use nvvp and profile the trtexec remotely. I tested both with and without DLA (on GPU). The first figure shows the result with DLA:


The next figure shows the result without DLA (on GPU):

Seems like nvprof can detect more Cuda Kernels when the program is running on GPU. For DLA case, there is just a gap and I don’t see any related Kernels. Could you explain how to see those Cuda Kernels for DLA case?
The test command line for DLA is :
./trtexec --avgRuns=1 --deploy=resnet50.prototxt --int8 --batch=1 --iterations=1 --output=fc1000 --useDLACore=1 --useSpinWait --allowGPUFallback
and without DLA is:
./trtexec --avgRuns=1 --deploy=resnet50.prototxt --int8 --batch=1 --iterations=1 --output=fc1000 --useSpinWait --allowGPUFallback

Thanks for your time.

Hi,

Sorry that our profiler doesn’t support DLA yet.

In the DLA pipeline, you can still capture some kernel usage since DLA use GPU for some data conversion.
However, the inference job performed via DLA won’t show on the timeline.

Thanks.