Jetson TX2 Performance

Can anyone tell me the approximate number of GFLOPS the Jetson TX2 is capable of for 32 bit and 64 bit floats, respectively? I am considering purchasing one to experiment with GPU programming, and am having trouble finding these figures on the web.

Is this a trick question? The very first Google hit leads to:

http://www.aetina.com.tw/products-detail.php?i=210
Single-Precision Floating Point (GFLOPS) 1.5 TFLOPS
Memory Bandwidth (GB/sec) 58.3

This is a Pascal-family device, so double-precision throughout should be 1/32 of single-precision throughput. i.e. 46.8 GFLOPS. A few hits further down the list of Google hits I find this:

http://www.electronicdesign.com/industrial/dnn-popularity-drives-nvidia-s-jetson-tx2
The increased interest in deep neural nets (DNNs) and deep learning are driving the popularity of platforms like NVidia’s Jetson TX2 (Fig. 1). The 50-mm by 87-mm Jetson TX2 module has a 256-core NVIDIA Pascal GPU, a pair of 64-bit NVIDIA Denver 2 ARM-compatible cores, and four 64-bit ARM A57 cores. It delivers 2 TFLOPs of single precision performance.

No idea why there is a discrepancy in the TFLOPS numbers. Maybe the latter number sums the FLOPS of the GPU and the CPU cores?

Not a trick question. Like you, I was confused about the discrepancy between the numbers that I read on various pages. As you cited, the first Google hit says “1.5 TFLOPS.” Another I found was less specific and said, “more than a TFLOP.” And then there’s the second one you reference that says “2 TFLOP.”

“This is a Pascal-family device, so double-precision throughout should be 1/32 of single-precision throughput. i.e. 46.8 GFLOPS.”

Actually, it looks like there are several models in the Pascal line, each of which varies considerably in their performance characteristics (this was the source of my confusion as well.) For example, for the GP100, double-float throughput is 1/2 single, and for the GP102 that’s on the TX2, it is 1/32.

https://en.wikipedia.org/wiki/Pascal_(microarchitecture)

The specs shown in the “very first Google hit” do not explicitly mention that the TX2 uses the GP102 (Some more digging just now yields that information.)

TX2 doesn’t use the GP102

What does the TX2 use, then? I really can’t seem to find a reference anywhere that makes this explicit.

GP102 is a standalone PCIE GPU, with a compute capability of cc 6.1

TX2 is a SOC, with an embedded pascal-family GPU.

The TX2 GPU is a compute-capability 6.2 GPU with 256 cores, arranged into 2 cc 6.x SMs. (note that the pascal family SM design is bifurcated between the cc 6.0 SM design and the cc “6.x” (i.e. 6.1 and 6.2) SM design (with further additional differences between 6.1 and 6.2)

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability

Computing performance for the GPU component of the TX2 SOC follows similar arithmetic to any CUDA GPU performance computation. Referring to this table:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

we can use arithmetic described generally here:

https://stackoverflow.com/questions/43478827/how-to-determine-if-my-gpu-does-16-32-64-bit-arithmetic-operations/43479468#43479468

to calculate peak theoretical arithmetic throughput of any particular CUDA GPU.

the arithmetic is:

of SMs * # of FP32 units per SM * 2 * clock rate

(for single precision)

For double precision, we just use the # of FP64 units per SM in place of the FP32 units in the above formula. The * 2 in the formula accounts for the fact that peak throughput is obtained by scheduling FFMA instructions, which count as 2 floating-point ops each.

For TX2, then, the FP32 throughput is 512 * clock rate

For FP64 throughput, it is 1/32 of this number, so 16 * clock rate

This doesn’t account for throughput contributions from any other source on the SOC, such as the ARM cores. Note that in practice, you would only be able to get close to these numbers for certain carefully chosen types of arithmetic operations, such as dense matrix-matrix multiply.

The above is an attempt to answer the original question you posed in this thread. (“for 32 bit and 64 bit floats”)
However, when looking at numbers on the web, be advised that numbers without careful qualification are usually referring to FP16 throughput for the GPU family processors that support it as a full-rate option. cc 6.2 and cc 6.0 GPUs are in this category, and so the numbers like e.g. 1.5TFlops are almost certainly referring to peak FP16 throughput, which will be double again of the FP32 number computed above (for this particular cc 6.2 GPU). Refer to the table previously given for instructions throughputs per clock per SM.

NVIDIA’s blog contains relevant useful specs, including clock rates:

https://devblogs.nvidia.com/parallelforall/jetson-tx2-delivers-twice-intelligence-edge/

As already mentioned, numbers on the web may also be aggregating peak throughputs from several different sources on the SOC, such as GPU and ARM cores, to name just two.

@dialtr: You asked for an approximate number, so I assume it shouldn’t matter whether it’s 1.5 TFLOPS or 2 TFLOPS. Realistically, the most common bottleneck of the part is likely not the computational throughput, but the memory bandwidth it provides.

@txbob: OK, you got me confused now. If the GPU in the TX2 is not equivalent to the GP102, surely it’s equivalent to some other variant of Pascal where DP throughput is 1/32 of SP? [Later:] Two ships crossing in the night :-) While I was typing, you posted a detailed explanation.

Thanks for the detailed answer!

//把最后一层输出值,结果就是返回每个类的概率
    Blob<float>* output_layer = net_->output_blobs()[0];
    const float* data = output_layer->cpu_data();
    int length = output_layer->height()*output_layer->width();
    int lengthx2 = output_layer->height()*output_layer->width()*2;
    int channel = 0;
    const float *max = NULL;

    int64_t et6=cv::getTickCount();
    for(int i = length/3;i<length;i++){

            channel = 0;
            max = (data+i);

            if(*(data+i) < *(data+i+length))
            {
                max = (data+i+length);
                channel = 255;
            }

            if(*max < *(data+i+lengthx2))
            {
                channel = 0;
            }

            uchar * data = show.ptr<uchar>(i/640);
            data[i%640] = channel;
    }
    int64_t et5=cv::getTickCount();
    std::cout<<"for test time =  " <<(et5-et6)*1000.0/cv::getTickFrequency()<<std::endl;

I run this code in the computer only 1.5ms,but in the tx2 it is 40ms,why
Computational efficiency is different.can anyone give me some advices?

Because the ARM core in your TX2 is slower than the core in your computer processor.
If you look carefully at published TX2 (Jetson) info, you will find instructions for increasing the throughput of the ARM cores.

https://elinux.org/Jetson/Performance

Hi txbob

//把最后一层输出值,结果就是返回每个类的概率
    Blob<float>* output_layer = net_->output_blobs()[0];
    const float* data = output_layer->cpu_data();
    int length = output_layer->height()*output_layer->width();
    int lengthx2 = output_layer->height()*output_layer->width()*2;
    int channel = 0;
    const float *max = NULL;

    int64_t et6=cv::getTickCount();
    for(int i = length/3;i<length;i++){

            channel = 0;
            max = (data+i);

            if(*(data+i) < *(data+i+length))
            {
                max = (data+i+length);
                channel = 255;
            }

            if(*max < *(data+i+lengthx2))
            {
                channel = 0;
            }

            uchar * data = show.ptr<uchar>(i/640);
            data[i%640] = channel;
    }
    int64_t et5=cv::getTickCount();
    std::cout<<"for test time =  " <<(et5-et6)*1000.0/cv::getTickFrequency()<<std::endl;

I run this code in the tx2 (jetpack 3.1) 40ms,but in the tx2(jetpack 3.0) it is 180ms,wwhy
Computational efficiency is different.can you give me some advices?

A speedup with a newer software version could be due to compiler improvements, but the magnitude of the difference suggests the code was not compiled with the same compiler settings. The people in the TX2 subforum are likely able to provide better suggestions (maybe JetPack 3.1 also came with a new BSP that enabled higher cock frequencies than the previous version, who knows?):

https://devtalk.nvidia.com/default/board/188/jetson-tx2/

hi:

Blob<float>* output_layer = net_->output_blobs()[0];
    const float* data = output_layer->cpu_data();
    
    memcpy(class_data_,data,640*480*3*sizeof(float));
    int length = 640*480;
    int lengthx2 = 640*480*2;
    int channel = 0;
    const float *max = NULL;

    int64_t et6=cv::getTickCount();
    for(int i = length/3;i<length;i++){

            channel = 0;
            max = (class_data_+i);

            if(*(class_data_+i) < *(class_data_+i+length))
            {
                max = (class_data_+i+length);
                channel = 255;
            }

            if(*max < *(class_data_+i+lengthx2))
            {
                channel = 0;
            }

            uchar * show_data = show.ptr<uchar>(i/640);
            show_data[i%640] = channel;
    }
    int64_t et5=cv::getTickCount();

I change my code like this( memcpy(class_data_,data,6404803*sizeof(float))) and I got data from class_data_ to calculate, then I found that the time approximately is 10ms. so it mean that gup and cpu didn’t share common memory?