CUDA Kernel runs much slower on TX1 than on discrete GPU

rexroni · February 28, 2016, 4:35am

I am new to CUDA programming, but I have run across an issue which does not make sense to me, which is that my CUDA kernel runs like 50x-100x slower than I think it should on either a Jetson TK1 or TX1.

I have run the same code on the following devices (native compilation in each case):

A GT 540M (2.1 compute capability, 96 cores, 1GB memory), it takes around 1.4 seconds.
A Quadro K6000 (3.5 compute capability, more cores than my program could even fill, 12 GB of memory), it takes around 0.15 seconds
A TK1 (3.2 compute capability, 192 cores), it takes 104 seconds (74x slower than the GT 540M).
A TX1 (5.3 compute capability, 256 cores), it takes 88 seconds (62x slower than the GT 540M). Also, only in the case of the TX1, dmesg says:

nvmap_pgprot: PID 6467: a.out: TAG: 0x0800 WARNING: NVMAP_HANDLE_WRITE_COMBINE should be used in place of NVMAP_HANDLE_UNCACHEABLE on ARM64

I don’t understand what would cause such a slowdown. If anything, I would have expected the both the TK1 and the TX1 to outperform the GT 540M laptop GPU. The fact that the TK1 and the TX1 have a similar performance penalty, even though the Quadro K6000 has a compute capability in between them tells me the issue is related to naively porting CUDA code from a discrete GPU to an integrated GPU, but I don’t know what I need to do differently.

Could the dmesg warning have anything to do with this severe performance gap? Or is there something else really big that I am missing?

dusty_nv · February 29, 2016, 2:43pm

Have you tried running the jetson_max_l4t.sh script before? Available here: https://github.com/dusty-nv/jetson-scripts/blob/master/jetson_max_l4t.sh

It increases the clock speeds that the performance governor is allowed to use to their maximum rates. In particular for GPU perf, the relevant commands are:

# max GPU clock (should read from debugfs)
cat /sys/kernel/debug/clock/gbus/max > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
echo "GPU: `cat /sys/kernel/debug/clock/gbus/rate`"

rexroni · March 1, 2016, 12:12am

Thank you Dusty. That script brought down the run times to 6.8 seconds on the TX1 and 8.8 seconds on the TK1, about 12x faster.

In a general sense, am I wrong to expect the TX1 to out preform an older laptop GPU with fewer than half as many cores? I guess I will try to profile the code on each platform and see what is taking so long, but that just seems a bit strange to me.

linuxdev · March 1, 2016, 12:23am

One of the apples-and-oranges comparison issues is that Jetson does not have any dedicated GPU memory. Usually laptops and PCs have their own dedicated GPU memory.

Jimmy_Pettersson · March 1, 2016, 11:52am

If your application is not compute bound it would scale more with bandwidth than NumberCores * FREQ.

rexroni · March 1, 2016, 2:37pm

Does this restriction apply to shared memory as well or just global memory? Also, is there a document or something I can read to learn more about targeting a Tegra device in my code?

That makes sense. Is that true for shared memory as well? When I profiled my code running on the laptop, it was limited by a huge number of shared memory accesses. It’s relatively light on global memory.

Thanks for the responses, have a lot left to learn about all of this.

linuxdev · March 1, 2016, 3:25pm

This article gives a fairly good idea about how memory changes performance for Jetson versus PCIe dedicated:
[url]http://arrayfire.com/zero-copy-on-tegra-k1/[/url]

Jimmy_Pettersson · March 1, 2016, 5:53pm

The SM of the GPU:s in the TX1 should have the same number of shared memory load/store or more than the aging 540m, so it should not be a bigger bottleneck on the TX1, this does not explain the performance difference. Hence your bottleneck is very likely somewhere else.

Are you spilling any registers? Fermi /Kepler/Maxwell all handle register spilling a bit differently.

In my experience coding and tuning for the Tegra K1 has scaled rather linearly with theoretical bandwidth or compute performance (GFLOPS/s, NbCores*FREQ) depending on the type of kernel (compute or bandwidth bound).

Hence your results appear rather strange when looking at the numbers on the GT540m.

Try to profile and figure out your theoretical utilization of your kernels with regard to bandwidth and compute.

rexroni · March 2, 2016, 2:52am

Ok, thanks a bunch guys. I’ll look into those suggestions.

Topic		Replies	Views
Jetson TK1 GK20A vs GF108M game performance Jetson TK1	4	4651	July 19, 2014
Porting from TK1 to TX1 Jetson TX1	3	1406	January 20, 2016
TX1 vs TK1 CPU Jetson TX1	7	21035	December 17, 2015
Reading globaltimer register or calling clock/clock64 in loop prevent concurrent kernel execution? CUDA Programming and Performance	17	2711	March 25, 2017
TK1 Memory Bandwidth Jetson TK1	9	5525	June 22, 2014
CPU performance problem on Jetson TX1 Jetson TX1	23	3209	October 18, 2021
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6125	February 22, 2016
CUDA hangups Jetson TK1	26	3666	October 18, 2021
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17176	March 16, 2017
CUDA performance issue on tx2 Jetson TX2	8	4259	October 18, 2021

CUDA Kernel runs much slower on TX1 than on discrete GPU

Related topics