Jetson AGX xavier fp64 performance

caohw1 · June 10, 2021, 2:08am

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

Hi,

Bought Jetson AGX xavier, installed ubuntu 18.04lts, tried fp32/fp64 performance of 1 giga times of fp32/fp64 add. The following is my kernal function, there are 1000 times kernal calls and each kernal call has about 1 millon times of fp32 or fp64 add. There are 2 runs, one with float “vtype” and another is double “vtype”, checked the ptx code and there is only f32.add vs f64.add difference. For the performance, the 1 giga times of fp32 add run takes about 3.16 second, and the 1 giga times of fp64 add takes about 31.55 second to finish. The fp64 add is about 10x slower than fp32 add, but according Jetson agx xavier spec the fp32/fp64 gflops is 1:2.

So is there any options to enable fp64 fast performance?

Thanks
Henry

--------- kernal function
global void
vectorAdd(vtype* const A) //–> vtype is float or double
{
const auto i = blockDim.x * blockIdx.x + threadIdx.x;
auto a = A[i];
for (auto j = numAddsPerThread / 10; j >0 ; --j) {
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
}
A[i] = a;
}

kayccc · June 10, 2021, 2:32am

Hi,
This looks like a Jetson issue, moving into Jetson AGX Xavier forum.

AastaLLL · June 10, 2021, 3:16am

Hi,

We want to reproduce this on our environment first.
Could you share the complete source with us?

Thanks.

caohw1 · June 10, 2021, 4:17am

Thanks for the quick response, here is the source. Line 47 is for fp64(double), plz change to line 48 “typedef float vtype;” for testing fp32(float). Thanks, ~Henry

#include <stdio.h>
// For the CUDA runtime routines (prefixed with “cuda_”)
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include

class jetson
{
public:
class memManager
{
public:
//1.652u 1.396s
class zeroCopy
{
public:
static void* malloc(const unsigned size) {assert(size); void* p =0; checkCudaErrors(cudaMallocHost(&p, size)); assert(p); return p;}
static void* getDevPtr(void* const p) {assert(p); void* p2 =0; checkCudaErrors(cudaHostGetDevicePointer(&p2, p, 0)); assert(p2); return p2;}
static void free(void* const p) {assert(p); checkCudaErrors(cudaFreeHost(p));}
};

//2.324u 4.496s
class unified
{
public:
  static void* malloc(const unsigned size) {assert(size); void* p =0; checkCudaErrors(cudaMallocManaged(&p,  size)); assert(p); return p;}
  static void* getDevPtr(void* const p) {assert(p); return p;}
  static void free(void* const p) {assert(p); checkCudaErrors(cudaFree(p));}
};

};

static const unsigned numWarps = 16;
static const unsigned numThreadsPerWarp = 32;
static const unsigned numThreads = numWarps * numThreadsPerWarp;

static void sync() {cudaDeviceSynchronize();}
};

// to dump ptx, /usr/local/cuda/bin/cuobjdump gflops.o -ptx

//double: 0.044u 0.040s 0:31.55 0.2% 0+0k 0+0io 0pf+0w
//float: 0.032u 0.040s 0:03.16 2.2% 0+0k 0+0io 0pf+0w
//typedef double vtype;
typedef float vtype;

static const auto numAddsPerThread = 1000000;

global void
vectorAdd(vtype* const A)
{
const auto i = blockDim.x * blockIdx.x + threadIdx.x;
auto a = A[i];
for (auto j = numAddsPerThread / 10; j >0 ; --j) {
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
}
A[i] = a;
}

/**

Host main routine
*/
int
main(void)

{
typedef jetson::memManager::unified mm;

// Launch the Vector Add CUDA Kernel
const unsigned numLoops = 1000;
const auto numBlocks = 1;
const auto size = numBlocks * jetson::numThreadsPerWarp;

const auto numAdds = numAddsPerThread * numLoops;

printf(“num calc per thread %1.2f G\n”, numAdds / 1.e9);

//cudaSetDeviceFlags(cudaDeviceMapHost);

const auto h_A = (vtype*)mm::malloc(size * sizeof(vtype));
const auto d_A = (vtype*)mm::getDevPtr(h_A); 
for (int i = 0; i < jetson::numThreads; ++i) h_A[i] = 0.;

for (auto i = 0; i < numLoops; ++i) {
  //1000x  launch: 0.060u 0.020s
  //10000x launch:dit 2.180u 3.372s
  vectorAdd<<<numBlocks, 1>>>(d_A);
}
jetson::sync();

const auto err = cudaGetLastError();
if (err != cudaSuccess)
{
    fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
    exit(EXIT_FAILURE);
}

printf("CUDA kernel done\n");


// Verify that the result vector is correct
for (int i = 0; i < size; ++i) {
  if (h_A[i] != numAdds) {
        fprintf(stderr, "Result verification failed at element %d!\n", i);
        fprintf(stderr, "%1f != %f \n", h_A[i] / 1.e9, numAdds / 1.e9);
        exit(EXIT_FAILURE);
    }
}

printf("Test PASSED\n");

mm::free(h_A);

printf("Done\n");
return 0;

}

caohw1 · June 15, 2021, 10:57pm

Any unknown factors/difficulties to repeat these fp32/fp64 performance number?

caohw1 · June 23, 2021, 11:03pm

Hi, AastaLLL, did anyone look into this? can they repeat? found any issue or recommand any options to resolve this? Thanks, ~Henry

AastaLLL · June 29, 2021, 7:52am

Hi,

Sorry for keeping you waiting.
We test your source on an xaiver board.

But it seems to have some issue in the verification.
Do we miss anything?

num calc per thread 1.00 G
CUDA kernel done
Result verification failed at element 0!
0.016777 != 1.000000

Thanks.

caohw1 · June 29, 2021, 7:57pm

Thanks AastaLLL, please ignore the verification msg.

You may run with “time …”, and compare 2 run times of fp64(double) and fp32(float). I got roughly 3.16 second and 31.55 second, it’s 10x different, comparing to the spec 2x different.

As you can see, the vectorAdd function has mostly “a += 1.”, the run time is from 1e9 times “a += 1.”.

Thanks, ~Henry

AastaLLL · June 30, 2021, 2:56am

Hi,

Thanks for your update.

Confirmed that we can reproduce this issue internally.
With maximum performance (nvpmodel=0+jetson_clocks), we got 3m for float and 31s for double.

We are checking this with our internal team.
Will get back to you later.

Thanks.

AastaLLL · June 30, 2021, 3:26am

More, could you share the document that mentioned fp32/fp64 1:2 performance?

caohw1 · June 30, 2021, 5:08am

Jetson AGX xavier has volta GPU, so fp32 gflops vs fp64 gflops suppose to be 2:1

caohw1 · June 30, 2021, 7:53am

Also, I searched, maybe helpful, GTX Titan has a setting, default is disabled with fp32/fp64 24:1, turning on will change fp32/fp64 speed to 3:1

https://codeyarns.com/tech/2013-12-05-how-to-enable-full-speed-fp64-in-nvidia-gpu.html

AastaLLL · June 30, 2021, 8:14am

Hi,

Thanks.

We are checking this with our internal team.
Will get back to you soon.

AastaLLL · July 1, 2021, 8:53am

Hi,

We got some feedback from our internal team.

Xavier fp64:fp32 throughput is 4:128, which means fp64 is 32x slow in terms of peak throughput.
On the integrated GPUs, we expect users to work with low-precision use cases (e.g. FP16, INT8 DL inference). So FP64 is slow by design.

Thanks.

caohw1 · July 1, 2021, 7:08pm

I can see jetson nano fp32/fp64 gflops is 1:32 (NVIDIA Jetson Nano GPU Specs | TechPowerUp GPU Database), but for jetson xavier nx and jetson agx xavier thee fp32/fp64 gflops is 1:2 (NVIDIA Jetson Xavier NX GPU Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier GPU Specs | TechPowerUp GPU Database).

So SERIOUSLY, is your “4:128” for all xavier products (nano & nx & agx)? anything WRONG with the spec?

AastaLLL · July 5, 2021, 5:10am

Hi,

The throughput data is from our internal team for the GV11B chip.
Which stands for Xavier and XavierNX.

The page you shared is a third-party link, not from NVIDIA.
Have you seen a similar report in our official spec/doc?

Since fp64 is important for HPC workloads.
Some volta Tesla class GPU will have a much higher FP64 throughput.
Not sure if the page uses the data for all the volta generation GPU.

Thanks.

caohw1 · July 6, 2021, 7:12pm

Hi, AastaLLL, since the internal team has repeated the original performance result, so why my test result is 1:10 which is quite different with 1:32? by cuorjdump and checking PTX code, I can see most kernal function runtime are from add.f32 or add.f64.

caohw1 · July 9, 2021, 10:55pm

Hi, AastaLLL, any explanation for my 1:10 result?? which is quite different with 1:32

AastaLLL · July 12, 2021, 5:45am

Hi,

This might be the result that not well-designed to saturate GPU throughput.
We can reproduce the 1:32 result with our CUDA example:

FP32

~1042 GFLOP/s

nvidia@nvidia-desktop:/usr/local/cuda-10.2/samples/5_Simulations/nbody$ ./nbody -benchmark
...
GPU Device 0: "Xavier" with compute capability 7.2

> Compute 7.2 CUDA device: [Xavier]
8192 bodies, total time for 10 iterations: 12.876 ms
= 52.121 billion interactions per second
= 1042.410 single-precision GFLOP/s at 20 flops per interaction

FP64

~31 GFLOP/s

nvidia@nvidia-desktop:/usr/local/cuda-10.2/samples/5_Simulations/nbody$ ./nbody -benchmark -fp64
...
GPU Device 0: "Xavier" with compute capability 7.2

> Compute 7.2 CUDA device: [Xavier]
8192 bodies, total time for 10 iterations: 648.288 ms
= 1.035 billion interactions per second
= 31.055 double-precision GFLOP/s at 30 flops per interaction

Thanks.

Topic		Replies	Views
Where to find Xavier Performance in FLOPs? Jetson AGX Xavier	3	6314	April 6, 2021
Jetson AGX slower than TX2 Jetson AGX Xavier	7	1134	May 22, 2019
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	5	3121	June 11, 2020
FP16AndFP32 speed question Jetson AGX Xavier	3	440	May 27, 2019
Simple CUDA example 4x slower on Xavier AGX GPU than CPU Jetson AGX Xavier cuda	2	581	March 4, 2021
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	8	2420	November 30, 2018
Mixed Precision (Tensor) vs raw FP16 / raw FP32 Compute Metrics Jetson AGX Xavier tensorrt , hw , cuda , jetson-inference	3	846	August 24, 2021
xavier ternsorrt mnist fp16 is slower the fp32? Jetson AGX Xavier	9	1038	October 16, 2019
CUDA kernel and Xavier performance Jetson AGX Xavier	7	1032	March 9, 2020
Unstable performance across multiple Jetson AGX Xavier devices DeepStream SDK fps , jetson , deepstream	4	952	October 30, 2023

Jetson AGX xavier fp64 performance

FP32

FP64

Related topics