I am new to CUDA programming, but I have run across an issue which does not make sense to me, which is that my CUDA kernel runs like 50x-100x slower than I think it should on either a Jetson TK1 or TX1.
I have run the same code on the following devices (native compilation in each case):
- A GT 540M (2.1 compute capability, 96 cores, 1GB memory), it takes around 1.4 seconds.
- A Quadro K6000 (3.5 compute capability, more cores than my program could even fill, 12 GB of memory), it takes around 0.15 seconds
- A TK1 (3.2 compute capability, 192 cores), it takes 104 seconds (74x slower than the GT 540M).
- A TX1 (5.3 compute capability, 256 cores), it takes 88 seconds (62x slower than the GT 540M). Also, only in the case of the TX1, dmesg says:
nvmap_pgprot: PID 6467: a.out: TAG: 0x0800 WARNING: NVMAP_HANDLE_WRITE_COMBINE should be used in place of NVMAP_HANDLE_UNCACHEABLE on ARM64
I don’t understand what would cause such a slowdown. If anything, I would have expected the both the TK1 and the TX1 to outperform the GT 540M laptop GPU. The fact that the TK1 and the TX1 have a similar performance penalty, even though the Quadro K6000 has a compute capability in between them tells me the issue is related to naively porting CUDA code from a discrete GPU to an integrated GPU, but I don’t know what I need to do differently.
Could the dmesg warning have anything to do with this severe performance gap? Or is there something else really big that I am missing?