Cuda 7.0 Jetson TX1 performance and benchmarks

Has anyone run benchmarks on TX1? I got glmark2 score 818 on my Shield TV.

simpleMulticopy produced poorer performance than TK1:

[simpleMultiCopy] - Starting…

Using CUDA device [0]: GM20B
[GM20B] has 2 MP(s) x 128 (Cores/MP) = 256 (Cores)
Device name: GM20B
CUDA Capability 5.3 hardware with 2 multi-processors
scale_factor = 1.00
array_size = 4194304

Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 25.526670 ms
Compute can overlap with one transfer: 19.573042 ms
Compute can overlap with both data transfers: 15.620518 ms

Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 9.440632 ms
Avg. time when overlapped using 4 streams : 5.101471 ms
Avg. speedup gained (serialized - overlapped) : 4.339161 ms

Measured throughput:
Fully serialized execution : 3.554257 GB/s
Overlapped using 4 streams : 6.577403 GB/s

The following results were from Tegra K1 (Chromebook CB5):

[simpleMultiCopy] - Starting…
modprobe: FATAL: Module nvidia not found.

Using CUDA device [0]: GK20A
[GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
Device name: GK20A
CUDA Capability 3.2 hardware with 1 multi-processors
scale_factor = 1.00
array_size = 4194304

Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
( ) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
Memcpy host to device : 1.233408 ms (13.602325 GB/s)
Memcpy device to host : 1.231520 ms (13.623177 GB/s)
Kernel : 2.142368 ms (78.311548 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 4.607296 ms
Compute can overlap with one transfer: 2.464928 ms
Compute can overlap with both data transfers: 2.142368 ms

Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 5.033206 ms
Avg. time when overlapped using 4 streams : 4.325859 ms
Avg. speedup gained (serialized - overlapped) : 0.707348 ms

Measured throughput:
Fully serialized execution : 6.666611 GB/s
Overlapped using 4 streams : 7.756709 GB/s

Can you try running the ‘max perf script’ listed below on TX1 before benchmarking?

#!/bin/sh

# turn on fan for safety
echo "Enabling fan for safety..."
if [ ! -w /sys/kernel/debug/tegra_fan/target_pwm ] ; then
	echo "Cannot set fan -- exiting..."
fi
echo 255 > /sys/kernel/debug/tegra_fan/target_pwm

echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo 1 > /sys/kernel/cluster/immediate
echo 1 > /sys/kernel/cluster/force
echo G > /sys/kernel/cluster/active
echo "Cluster: `cat /sys/kernel/cluster/active`"

# online all CPUs - ignore errors for already-online units
echo "onlining CPUs: ignore errors..."
for i in 0 1 2 3 ; do
	echo 1 > /sys/devices/system/cpu/cpu${i}/online
done
echo "Online CPUs: `cat /sys/devices/system/cpu/online`"

# set CPUs to max freq (perf governor not enabled on L4T yet)
echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cpumax=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies | awk '{print $NF}'`
echo "${cpumax}" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
for i in 0 1 2 3 ; do
	echo "CPU${i}: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq`"
done

# max GPU clock (should read from debugfs)
cat /sys/kernel/debug/clock/gbus/max > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
echo "GPU: `cat /sys/kernel/debug/clock/gbus/rate`"

# max EMC clock (should read from debugfs)
cat /sys/kernel/debug/clock/emc/max > /sys/kernel/debug/clock/override.emc/rate
echo 1 > /sys/kernel/debug/clock/override.emc/state
echo "EMC: `cat /sys/kernel/debug/clock/emc/rate`"

Also please see these posts related to the script:

https://devtalk.nvidia.com/default/topic/894945/jetson-embedded-systems/jetson-tx1/post/4740508/#4740508
https://devtalk.nvidia.com/default/topic/894945/jetson-embedded-systems/jetson-tx1/post/4737266/#4737266

Thanks!

The fan did turn on and I got significant performance improvement:

Measured timings (throughput):
Memcpy host to device : 1.670526 ms (10.043074 GB/s)
Memcpy device to host : 1.709841 ms (9.812150 GB/s)
Kernel : 2.699841 ms (62.141496 GB/s)

versus before running the script:

Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)

but I got some errors when running the script;

ubuntu@tegra-ubuntu:~/x1$ sudo ./maxPerf.sh
Enabling fan for safety…
./maxPerf.sh: 11: ./maxPerf.sh: cannot create /sys/kernel/cluster/immediate: Directory nonexistent
./maxPerf.sh: 12: ./maxPerf.sh: cannot create /sys/kernel/cluster/force: Directory nonexistent
./maxPerf.sh: 13: ./maxPerf.sh: cannot create /sys/kernel/cluster/active: Directory nonexistent
cat: /sys/kernel/cluster/active: No such file or directory
Cluster:
onlining CPUs: ignore errors…
./maxPerf.sh: 19: ./maxPerf.sh: cannot create /sys/devices/system/cpu/cpu0/online: Directory nonexistent
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
Online CPUs: 0-3
CPU0: 2014500
CPU1: 2014500
CPU2: 2014500
CPU3: 2014500
GPU: 998400000
EMC: 1600000000

Those errors from the script are erroneous (all the CPU cores should be online, ect.), but you can keep an eye on tegrastats to check it:

ubuntu@tegra-ubuntu:$ ~/tegrastats
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [4%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912

Here are the acronyms of the different clusters that tegrastats reports:

EMC – memory controller
AVP – audio/video processor
VDE – video decoder engine
GR3D – GPU

Where can I find “tegrastats”?

I did get 309 GFLOPS for sample “nbody” from tegra X1, i.e., twice of GFLOPs of tegra K1.

Huh? I get 259 Gflops from the K1 and “nbody -benchmark” (also on a CB5)

Good catch!

I followed the link above from dusty_nv:

https://devtalk.nvidia.com/default/topic/894945/jetson-embedded-systems/jetson-tx1/post/4740508/#4740508

to this link:

http://www.slothparadise.com/how-to-install-cuda-on-nvidia-jetson-tx1/

Which refer to this link for TK1 157 GFLOPS:

https://www.pugetsystems.com/labs/articles/NVIDIA-Jetson-TK1-CUDA-performance-569/

I just tried and did get 259 Gflops from TK1 based Chromebook CB5.

It seems TX1 still needs optimizations.

I re-run “nbody -benchmark -numbodies=65536” for both Shield TV(X1) and CB5(K1) and got:

X1: 315 GFLOPS

K1: 311 GFLOPS

What is missing for X1?

From what I’ve read the chromebook has different memories with much higher bandwidth than the regular tegra K1, this may well account for the performance difference as even the nbody problem could be memory bound.

Couldn’t find a source regarding the memory by quick googling.

Anyhow, ~157 GFLOP for nbody is pretty standard on Jetson Tk1.

313 GFLOPS on my machine:

Is your “machine” Jetson TX1? I do not have TX1, I used Shield TV and was worried if SDRAM for Shield TV is too small (3GB) and/or too slow.

For both TK1 and TX1, CPU/GPU clocks must be maximized to run the benchmark tests, as showed in links in multiple places.

It seems those scripts to maximize clock need to be run every time after power up.

yes jetson tx1

I’m glad $199 Shield TV with only 3GB ram gets same performance as TX1 for this test.

I did notice the “boxFilter” sample ran much faster on X1 than K1

Oh well, nbody might not be the best generic benchmark then - let’s try something else:

~/6.5_Samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GK20A" with compute capability 3.2

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Computing result using CUBLAS...done.
Performance= 223.12 GFlop/s, Time= 0.587 msec, Size= 131072000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

(That’s still on a CB5, apparently with wider memorybus than the Jetson.)

For “./matrixMulCUBLAS”,

My Shield TV X1 showed only 153 Gflops.
My CB5 K1 showed 207 Gflops.

On the other hand for “./boxFilter - benchmark”,

My Shield TV X1 showed 410 M RGBA Pixels/s.
My CB5 K1 showed only 37 M RGBA Pixels/s.

I compiled blender and tested the BMW scene in cycles.

It can do this in 9:48. ( BVH building alone on the CPU takes 55 seconds, post processing takes 20ish seconds ).

A high end desktop card can do this in under 30 seconds. there both building and post processing are 1-2 seconds.

http://www.pasteall.org/pic/show.php?id=96308

I ran convolutionFFT2D, the results:

K1: 114 MPix/s

X1: 250 MPixel/s