Jetson AGX slower than TX2

Hi,

I recently got the Jetson AGX and I have been running the same application I run on the TX2. My application is slower on the AGX.

I used nvprof on both to see the bottleneck and to my surprise there are simple API calls that take 2x or 3x longer on the AGX. The first two gemv() calls are faster on AGX. But most kernel’s after that are slower. Even the cuDriverGetVersion, cuInit are slower on the AGX.

Is there anything I can do about this?
I am also having this problem on the Jetson Nano.

nvprof dump from TX2:

Type Time(%) Time Calls Avg Min Max Name
GPU activities: 75.12% 231.93ms 708 327.58us 131.68us 1.2774ms void gemv2N_kernel_val<float, float, float, int=128, int=4, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
13.28% 41.012ms 354 115.85us 85.409us 354.02us void gemv2N_kernel_val<float, float, float, int=128, int=8, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
8.12% 25.081ms 23 1.0905ms 1.7920us 4.7079ms [CUDA memcpy HtoD]
1.87% 5.7641ms 5 1.1528ms 892.65us 1.4019ms sgemm_32x32x32_NN
0.63% 1.9367ms 354 5.4700us 2.4000us 26.048us kernel(float*, float*, float*, float*, float*, int)
0.47% 1.4529ms 354 4.1040us 1.7600us 17.569us sum3_kernel(float*, float*, float*, int)
0.27% 842.86us 30 28.095us 12.320us 100.93us transpose_kernel(float*, float*, int, int, int, int)
0.11% 332.23us 5 66.445us 59.713us 76.801us void ger_kernel<float, float, int=256, int=5, bool=0>(cublasGerParams<float, float>)
0.10% 319.65us 5 63.930us 58.433us 75.649us [CUDA memcpy DtoH]
0.02% 49.504us 15 3.3000us 1.7600us 11.936us [CUDA memcpy DtoD]
0.01% 26.400us 12 2.2000us 288ns 7.9360us [CUDA memset]
API calls: 75.26% 1.05386s 57 18.489ms 2.1760us 1.04343s cudaFree
15.04% 210.62ms 41 5.1371ms 9.0560us 41.427ms cudaStreamSynchronize
2.58% 36.147ms 1072 33.719us 20.703us 106.18us cudaLaunch
2.30% 32.172ms 33 974.90us 37.024us 5.5391ms cudaMemcpy
1.78% 24.938ms 738 33.791us 19.872us 94.208us cudaLaunch
0.99% 13.891ms 1068 13.006us 7.8400us 57.376us cudaEventRecord
0.72% 10.056ms 74 135.89us 11.232us 2.2524ms cudaMalloc
0.61% 8.5104ms 10 851.04us 402.24us 1.3941ms cudaMemcpy2D
0.21% 2.9699ms 6976 425ns 288ns 8.8320us cudaSetupArgument
0.10% 1.3752ms 360 3.8190us 2.4960us 22.720us cudaStreamWaitEvent
0.10% 1.3631ms 1810 753ns 416ns 28.703us cudaConfigureCall
0.05% 700.60us 201 3.4850us 2.0160us 17.344us cudaEventCreate
0.05% 632.25us 12 52.687us 11.104us 238.66us cudaMemsetAsync
0.04% 615.55us 233 2.6410us 1.5680us 6.5280us cudaEventDestroy
0.04% 614.46us 1077 570ns 320ns 2.0800us cudaGetLastError
0.04% 511.61us 47 10.885us 4.3840us 32.192us cudaStreamDestroy
0.03% 489.50us 47 10.414us 2.9120us 73.951us cudaStreamCreateWithFlags
0.03% 431.07us 67 6.4330us 2.1120us 16.128us cudaMemset
0.01% 176.38us 185 953ns 288ns 22.112us cuDeviceGetAttribute
0.00% 64.832us 32 2.0260us 1.4400us 5.1840us cudaEventCreateWithFlags
0.00% 59.104us 2 29.552us 19.744us 39.360us cudaGetDeviceProperties
0.00% 45.663us 4 11.415us 8.4160us 14.175us cudaThreadSynchronize
0.00% 26.912us 22 1.2230us 768ns 3.6480us cudaDeviceGetAttribute
0.00% 23.104us 2 11.552us 9.2480us 13.856us cuDeviceTotalMem
0.00% 12.160us 4 3.0400us 2.5280us 3.5840us cudaGetDevice
0.00% 9.0880us 1 9.0880us 9.0880us 9.0880us cudaSetDeviceFlags
0.00% 6.6560us 4 1.6640us 832ns 3.0400us cuDeviceGetCount
0.00% 3.3280us 3 1.1090us 512ns 1.5360us cuDeviceGet
0.00% 3.1360us 2 1.5680us 992ns 2.1440us cuDeviceGetName
0.00% 1.6960us 1 1.6960us 1.6960us 1.6960us cuInit
0.00% 1.3440us 1 1.3440us 1.3440us 1.3440us cuDriverGetVersion

nvprof dump from AGX:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.21% 189.05ms 708 267.02us 143.60us 413.76us void gemv2N_kernel_val<float, float, float, int=128, int=4, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor, cublasGemvTensor, float>>(float, float, float const )
12.33% 32.293ms 354 91.221us 89.448us 100.11us void gemv2N_kernel_val<float, float, float, int=128, int=8, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor, cublasGemvTensor, float>>(float, float, float const )
6.13% 16.037ms 24 668.19us 1.3440us 3.0363ms [CUDA memcpy HtoD]
4.89% 12.799ms 5 2.5598ms 2.5516ms 2.5715ms volta_sgemm_128x64_nn
2.29% 5.9849ms 354 16.906us 14.946us 21.986us kernel(float*, float*, float*, float*, float*, int)
1.36% 3.5677ms 354 10.078us 9.2170us 13.697us sum3_kernel(float*, float*, float*, int)
0.33% 855.43us 30 28.514us 17.601us 35.875us transpose_kernel(float*, float*, int, int, int, int)
0.23% 604.92us 5 120.98us 109.26us 129.00us [CUDA memcpy DtoH]
0.13% 343.74us 5 68.748us 59.525us 75.911us void ger_kernel<float, float, int=256, int=5, bool=0>(cublasGerParams<float, float>)
0.06% 156.85us 79 1.9850us 1.0880us 4.7360us [CUDA memset]
0.04% 110.86us 15 7.3900us 6.6560us 9.5040us [CUDA memcpy DtoD]
API calls: 84.52% 1.96575s 78 25.202ms 7.0080us 1.94681s cudaFree
4.26% 99.001ms 41 2.4147ms 9.0880us 9.4584ms cudaStreamSynchronize
3.29% 76.580ms 1072 71.436us 43.138us 331.54us cudaLaunchKernel
2.30% 53.500ms 738 72.492us 41.602us 284.02us cudaLaunchKernel
1.57% 36.452ms 74 492.59us 21.793us 6.7288ms cudaMalloc
1.30% 30.159ms 33 913.91us 135.46us 4.0889ms cudaMemcpy
1.02% 23.807ms 1068 22.291us 15.680us 124.36us cudaEventRecord
0.51% 11.880ms 67 177.31us 25.154us 524.99us cudaMemset
0.45% 10.511ms 10 1.0511ms 609.25us 2.4132ms cudaMemcpy2D
0.18% 4.2435ms 1 4.2435ms 4.2435ms 4.2435ms cudaMemcpyToSymbol
0.13% 3.0851ms 360 8.5690us 5.8250us 71.236us cudaStreamWaitEvent
0.09% 2.0943ms 201 10.419us 4.6400us 99.301us cudaEventCreate
0.08% 1.7886ms 233 7.6760us 3.2640us 56.515us cudaEventDestroy
0.05% 1.1193ms 47 23.814us 7.0090us 106.41us cudaStreamCreateWithFlags
0.05% 1.1189ms 47 23.807us 5.8890us 79.620us cudaStreamDestroy
0.04% 990.70us 1169 847ns 448ns 54.787us cudaGetLastError
0.04% 894.80us 12 74.566us 30.050us 144.71us cudaMemsetAsync
0.04% 882.76us 152 5.8070us 1.6000us 77.028us cudaFuncSetAttribute
0.03% 693.57us 189 3.6690us 2.1440us 65.188us cuDeviceGetAttribute
0.01% 338.07us 32 10.564us 4.7040us 82.757us cudaEventCreateWithFlags
0.01% 310.83us 2 155.42us 134.34us 176.49us cudaGetDeviceProperties
0.01% 163.11us 4 40.778us 8.0650us 89.029us cudaGetDevice
0.01% 149.19us 4 37.298us 2.8800us 129.64us cuDeviceGetCount
0.00% 111.88us 4 27.969us 17.057us 42.946us cudaDeviceSynchronize
0.00% 93.636us 22 4.2560us 2.8800us 11.552us cudaDeviceGetAttribute
0.00% 50.339us 1 50.339us 50.339us 50.339us cudaSetDeviceFlags
0.00% 49.540us 2 24.770us 20.322us 29.218us cuDeviceTotalMem
0.00% 27.777us 2 13.888us 3.3600us 24.417us cuDeviceGetName
0.00% 11.104us 3 3.7010us 2.4960us 4.6720us cuDeviceGet
0.00% 7.8400us 2 3.9200us 3.8080us 4.0320us cuDeviceGetUuid
0.00% 5.2490us 1 5.2490us 5.2490us 5.2490us cuInit
0.00% 3.2320us 1 3.2320us 3.2320us 3.2320us cuDriverGetVersion

Hi,

Have you maximized the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks.sh

Thanks.

Hi,

Maximizing the AGX performance did make me run ~10% faster.

I did not have to do that for the Jetson TX2 but my application still runs faster than AGX without maximizing the performance.
Can you tell me why that is the case?

I also don’t have a Jetson_clock.sh on the Jetson Nano. Is it OK to copy the one from AGX and maximize the performance?

Thanks!

Actually, I take that back. It is still slow.

To give you a little background, we run a speech recognition system on the Jetson platforms.

On Jetson TX2, after doing my benchmark, I get a processing speed of 0.7ms per 1ms of audio (without maximizing performance).

On the Jetson AGX, after running my benchmark (after maximizing performance) I am still at 0.8.5ms per 1ms audio.

So I still seem to be slower than Jetson TX2.
Can you tell me why that is the case?

Hi,

To benchmark, it’s recommended to maximize both device performance first.
The default mode is chosen by different strategy so may not be suitable for comparison.

This result is out of our expectation.
It looks like you are using gemv2N_kernel_val . We will check if we can reproduce this first.

Thanks.

It looks like a different algorithm to me.

TX2: sgemm_32x32x32_NN
Xavier: volta_sgemm_128x64_nn

Hi,

We are not able to reproduce this issue on our side.

From the batchCUBLAS sample, Xavier achieves better performance in all the test case.
For example, single kernels sgemm:

<b>[Xavier]</b>
^^^^ elapsed = 0.00015593 sec  GFLOPS=26.8994
<b>[TX2]</b>
^^^^ elapsed = 0.01104093 sec  GFLOPS=0.379887

Is there anything missing on our side? Or could you take a look if there are other bottlenecks in your implementation?

Thanks.

I will see if I can write a sample SGEMM() to reproduce this and get back to you.

Thanks.