[TX1 L4T24.2.1 -> L4T28.1 Porting]Huge performance degradation in CUDA

Hi All,

Huge performance degradation observed in CUDA (especially data transfer between host and device) when switched to L4T28.1 from L4T24.2.1 .

To confirm my observation, I used Bandwidth Test sample provided with CUDA samples.

See below the bandwidth differences

24.2.1

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA Tegra X1
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			9935.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			9878.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			19315.8

Result = PASS

28.1

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA Tegra X1
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			909.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			921.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			1758.3

Result = PASS

These values are taken after executing jetson_clocks.sh (similar performance variation without running this also).

Any thought on what could be the root cause? Could it be related to L4T28.1 or CUDA8.0?

Regards,
Rejeesh

Hi,

Thanks for your question.
We will check this issue and update information to you later.

Hi,

Here are our results for the bandwidthTest:

R24.2.1 Bandwidth(MB/s): 9561.1 ~ 19579.6
R28.1 Bandwidth(MB/s): 9889.9 ~ 19233.6

No degradation is found. Could you recheck it?

Thanks.

Hi,

Thank you for the tests.

So, has it got anything to do with the test environment? I run my tests from class 10 16GB SD card. Jetson is mounted to AUVIDA J100 board.

Or, is there anything to take care during 28.1 setup? Like flashing process, kernel build configuration, rootfs setup , or dtb configuration?

Regards,
Rejeesh

Hi,

We flash TX1 with JetPack3.1, and run following command to boost the device:

sudo ~/jetson_clocks.sh

Thanks.

Hi,

Are you running the CUDA benchmark sample from SD card?

My TX1 is also flashed with Jetpack 3.1 & jetson_clocks.sh is executed.

Regards,
Rejeesh

Hi,

Issue resolved. Adding FDT entry to the SD card extlinux.conf file was the root cause.

Instead flashing dtb to EMMC fixed the issue.

Regards,
Rejeesh