Jetson TX2 is slower than 5.0 device?

I run the sample matrixMulCUBLAS in CUDAsamples on jetson TX2 with the following results printing in the terminal:
Performance= 77.83GFlop/s, Time= 2.526 msec, Size= 196608000 Ops

While I run the sanme sample on a laptop with GeForce 940M and get the following result:
Performance= 443.65 GFlop/s, Time= 0.443 msec, Size= 196608000 Ops

It seems that the performance of 940M is better than Jetson TX2. But 940M is a compute-capability 5.0 GPU,and TX2 is cc6.2.
Is the device with higher cc supposed to be faster?


Please running below comment to enable maximized the governors:

sudo ./

Running samples on JetPack3.1/TX2:

nvidia@tegra-ubuntu:~/NVIDIA_CUDA-8.0_Samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS 
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
<b>Performance= 404.90 GFlop/s, Time= 0.486 msec, Size= 196608000 Ops</b>
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

@carolyuu. I maximaize the governors as you said and get the same reault as you. So on this sample the performance of TX2 is similar to 940M.
My first question: isn’t TX2 supposed to be faster than a ccompute-capability 5.0 GPU?

And I have read informed and run sudo ./ That says CPUs on TX2 have several working modes and user could switch between them manually.

Does the GPU on TX2 has different modes?
Is there any way to enable the maximized clock rate of GPU?


TX2 is on Pascal GPU design and should be faster than Maxwell(5.x) architecture.
But it is inappropriate to compare a desktop GPU with an embedding GPU. The processes number are entirely different.

It’s recommended to compare TX2 with TX1, which is also an embedding-level GPU.
Here is some performance report for your reference:

For working modes, you can check this page for details:

For maximized clock rate, please run jetson_clocks.

sudo ./