CUDA Kernel runs much slower on TX1 than on discrete GPU

I am new to CUDA programming, but I have run across an issue which does not make sense to me, which is that my CUDA kernel runs like 50x-100x slower than I think it should on either a Jetson TK1 or TX1.

I have run the same code on the following devices (native compilation in each case):

  • A GT 540M (2.1 compute capability, 96 cores, 1GB memory), it takes around 1.4 seconds.
  • A Quadro K6000 (3.5 compute capability, more cores than my program could even fill, 12 GB of memory), it takes around 0.15 seconds
  • A TK1 (3.2 compute capability, 192 cores), it takes 104 seconds (74x slower than the GT 540M).
  • A TX1 (5.3 compute capability, 256 cores), it takes 88 seconds (62x slower than the GT 540M). Also, only in the case of the TX1, dmesg says:
nvmap_pgprot: PID 6467: a.out: TAG: 0x0800 WARNING: NVMAP_HANDLE_WRITE_COMBINE should be used in place of NVMAP_HANDLE_UNCACHEABLE on ARM64

I don’t understand what would cause such a slowdown. If anything, I would have expected the both the TK1 and the TX1 to outperform the GT 540M laptop GPU. The fact that the TK1 and the TX1 have a similar performance penalty, even though the Quadro K6000 has a compute capability in between them tells me the issue is related to naively porting CUDA code from a discrete GPU to an integrated GPU, but I don’t know what I need to do differently.

Could the dmesg warning have anything to do with this severe performance gap? Or is there something else really big that I am missing?

Have you tried running the jetson_max_l4t.sh script before? Available here: https://github.com/dusty-nv/jetson-scripts/blob/master/jetson_max_l4t.sh

It increases the clock speeds that the performance governor is allowed to use to their maximum rates. In particular for GPU perf, the relevant commands are:

# max GPU clock (should read from debugfs)
cat /sys/kernel/debug/clock/gbus/max > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
echo "GPU: `cat /sys/kernel/debug/clock/gbus/rate`"

Thank you Dusty. That script brought down the run times to 6.8 seconds on the TX1 and 8.8 seconds on the TK1, about 12x faster.

In a general sense, am I wrong to expect the TX1 to out preform an older laptop GPU with fewer than half as many cores? I guess I will try to profile the code on each platform and see what is taking so long, but that just seems a bit strange to me.

One of the apples-and-oranges comparison issues is that Jetson does not have any dedicated GPU memory. Usually laptops and PCs have their own dedicated GPU memory.

If your application is not compute bound it would scale more with bandwidth than NumberCores * FREQ.

Does this restriction apply to shared memory as well or just global memory? Also, is there a document or something I can read to learn more about targeting a Tegra device in my code?

That makes sense. Is that true for shared memory as well? When I profiled my code running on the laptop, it was limited by a huge number of shared memory accesses. It’s relatively light on global memory.

Thanks for the responses, have a lot left to learn about all of this.

This article gives a fairly good idea about how memory changes performance for Jetson versus PCIe dedicated:
[url]http://arrayfire.com/zero-copy-on-tegra-k1/[/url]

The SM of the GPU:s in the TX1 should have the same number of shared memory load/store or more than the aging 540m, so it should not be a bigger bottleneck on the TX1, this does not explain the performance difference. Hence your bottleneck is very likely somewhere else.

Are you spilling any registers? Fermi /Kepler/Maxwell all handle register spilling a bit differently.

In my experience coding and tuning for the Tegra K1 has scaled rather linearly with theoretical bandwidth or compute performance (GFLOPS/s, NbCores*FREQ) depending on the type of kernel (compute or bandwidth bound).

Hence your results appear rather strange when looking at the numbers on the GT540m.

Try to profile and figure out your theoretical utilization of your kernels with regard to bandwidth and compute.

Ok, thanks a bunch guys. I’ll look into those suggestions.