Jetson AGX xavier fp64 performance

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

Hi,

Bought Jetson AGX xavier, installed ubuntu 18.04lts, tried fp32/fp64 performance of 1 giga times of fp32/fp64 add. The following is my kernal function, there are 1000 times kernal calls and each kernal call has about 1 millon times of fp32 or fp64 add. There are 2 runs, one with float “vtype” and another is double “vtype”, checked the ptx code and there is only f32.add vs f64.add difference. For the performance, the 1 giga times of fp32 add run takes about 3.16 second, and the 1 giga times of fp64 add takes about 31.55 second to finish. The fp64 add is about 10x slower than fp32 add, but according Jetson agx xavier spec the fp32/fp64 gflops is 1:2.

So is there any options to enable fp64 fast performance?

Thanks
Henry

--------- kernal function
global void
vectorAdd(vtype* const A) //–> vtype is float or double
{
const auto i = blockDim.x * blockIdx.x + threadIdx.x;
auto a = A[i];
for (auto j = numAddsPerThread / 10; j >0 ; --j) {
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
}
A[i] = a;
}

Hi,
This looks like a Jetson issue, moving into Jetson AGX Xavier forum.

Hi,

We want to reproduce this on our environment first.
Could you share the complete source with us?

Thanks.

Thanks for the quick response, here is the source. Line 47 is for fp64(double), plz change to line 48 “typedef float vtype;” for testing fp32(float). Thanks, ~Henry

#include <stdio.h>
// For the CUDA runtime routines (prefixed with “cuda_”)
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include

class jetson
{
public:
class memManager
{
public:
//1.652u 1.396s
class zeroCopy
{
public:
static void* malloc(const unsigned size) {assert(size); void* p =0; checkCudaErrors(cudaMallocHost(&p, size)); assert(p); return p;}
static void* getDevPtr(void* const p) {assert(p); void* p2 =0; checkCudaErrors(cudaHostGetDevicePointer(&p2, p, 0)); assert(p2); return p2;}
static void free(void* const p) {assert(p); checkCudaErrors(cudaFreeHost(p));}
};

//2.324u 4.496s
class unified
{
public:
  static void* malloc(const unsigned size) {assert(size); void* p =0; checkCudaErrors(cudaMallocManaged(&p,  size)); assert(p); return p;}
  static void* getDevPtr(void* const p) {assert(p); return p;}
  static void free(void* const p) {assert(p); checkCudaErrors(cudaFree(p));}
};

};

static const unsigned numWarps = 16;
static const unsigned numThreadsPerWarp = 32;
static const unsigned numThreads = numWarps * numThreadsPerWarp;

static void sync() {cudaDeviceSynchronize();}
};

// to dump ptx, /usr/local/cuda/bin/cuobjdump gflops.o -ptx

//double: 0.044u 0.040s 0:31.55 0.2% 0+0k 0+0io 0pf+0w
//float: 0.032u 0.040s 0:03.16 2.2% 0+0k 0+0io 0pf+0w
//typedef double vtype;
typedef float vtype;

static const auto numAddsPerThread = 1000000;

global void
vectorAdd(vtype* const A)
{
const auto i = blockDim.x * blockIdx.x + threadIdx.x;
auto a = A[i];
for (auto j = numAddsPerThread / 10; j >0 ; --j) {
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
a += 1.;
}
A[i] = a;
}

/**

  • Host main routine
    */
    int
    main(void)

{
typedef jetson::memManager::unified mm;

// Launch the Vector Add CUDA Kernel
const unsigned numLoops = 1000;
const auto numBlocks = 1;
const auto size = numBlocks * jetson::numThreadsPerWarp;

const auto numAdds = numAddsPerThread * numLoops;

printf(“num calc per thread %1.2f G\n”, numAdds / 1.e9);

//cudaSetDeviceFlags(cudaDeviceMapHost);

const auto h_A = (vtype*)mm::malloc(size * sizeof(vtype));
const auto d_A = (vtype*)mm::getDevPtr(h_A); 
for (int i = 0; i < jetson::numThreads; ++i) h_A[i] = 0.;

for (auto i = 0; i < numLoops; ++i) {
  //1000x  launch: 0.060u 0.020s
  //10000x launch:dit 2.180u 3.372s
  vectorAdd<<<numBlocks, 1>>>(d_A);
}
jetson::sync();

const auto err = cudaGetLastError();
if (err != cudaSuccess)
{
    fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
    exit(EXIT_FAILURE);
}

printf("CUDA kernel done\n");


// Verify that the result vector is correct
for (int i = 0; i < size; ++i) {
  if (h_A[i] != numAdds) {
        fprintf(stderr, "Result verification failed at element %d!\n", i);
        fprintf(stderr, "%1f != %f \n", h_A[i] / 1.e9, numAdds / 1.e9);
        exit(EXIT_FAILURE);
    }
}

printf("Test PASSED\n");

mm::free(h_A);

printf("Done\n");
return 0;

}

Any unknown factors/difficulties to repeat these fp32/fp64 performance number?

Hi, AastaLLL, did anyone look into this? can they repeat? found any issue or recommand any options to resolve this? Thanks, ~Henry

Hi,

Sorry for keeping you waiting.
We test your source on an xaiver board.

But it seems to have some issue in the verification.
Do we miss anything?

num calc per thread 1.00 G
CUDA kernel done
Result verification failed at element 0!
0.016777 != 1.000000

Thanks.

Thanks AastaLLL, please ignore the verification msg.

You may run with “time …”, and compare 2 run times of fp64(double) and fp32(float). I got roughly 3.16 second and 31.55 second, it’s 10x different, comparing to the spec 2x different.

As you can see, the vectorAdd function has mostly “a += 1.”, the run time is from 1e9 times “a += 1.”.

Thanks, ~Henry

Hi,

Thanks for your update.

Confirmed that we can reproduce this issue internally.
With maximum performance (nvpmodel=0+jetson_clocks), we got 3m for float and 31s for double.

We are checking this with our internal team.
Will get back to you later.

Thanks.

More, could you share the document that mentioned fp32/fp64 1:2 performance?

  1. NVIDIA Jetson AGX Xavier GPU Specs | TechPowerUp GPU Database
  2. https://www.quora.com/What-is-NVIDIA-Jetson-AGX-Xavier-What-are-the-possible-applications-of-it

Jetson AGX xavier has volta GPU, so fp32 gflops vs fp64 gflops suppose to be 2:1

Also, I searched, maybe helpful, GTX Titan has a setting, default is disabled with fp32/fp64 24:1, turning on will change fp32/fp64 speed to 3:1

https://codeyarns.com/tech/2013-12-05-how-to-enable-full-speed-fp64-in-nvidia-gpu.html

Hi,

Thanks.

We are checking this with our internal team.
Will get back to you soon.

Hi,

We got some feedback from our internal team.

Xavier fp64:fp32 throughput is 4:128, which means fp64 is 32x slow in terms of peak throughput.
On the integrated GPUs, we expect users to work with low-precision use cases (e.g. FP16, INT8 DL inference). So FP64 is slow by design.

Thanks.

I can see jetson nano fp32/fp64 gflops is 1:32 (NVIDIA Jetson Nano GPU Specs | TechPowerUp GPU Database), but for jetson xavier nx and jetson agx xavier thee fp32/fp64 gflops is 1:2 (NVIDIA Jetson Xavier NX GPU Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier GPU Specs | TechPowerUp GPU Database).

So SERIOUSLY, is your “4:128” for all xavier products (nano & nx & agx)? anything WRONG with the spec?

Hi,

The throughput data is from our internal team for the GV11B chip.
Which stands for Xavier and XavierNX.

The page you shared is a third-party link, not from NVIDIA.
Have you seen a similar report in our official spec/doc?

Since fp64 is important for HPC workloads.
Some volta Tesla class GPU will have a much higher FP64 throughput.
Not sure if the page uses the data for all the volta generation GPU.

Thanks.

Hi, AastaLLL, since the internal team has repeated the original performance result, so why my test result is 1:10 which is quite different with 1:32? by cuorjdump and checking PTX code, I can see most kernal function runtime are from add.f32 or add.f64.

Hi, AastaLLL, any explanation for my 1:10 result?? which is quite different with 1:32

Hi,

This might be the result that not well-designed to saturate GPU throughput.
We can reproduce the 1:32 result with our CUDA example:

FP32

~1042 GFLOP/s

nvidia@nvidia-desktop:/usr/local/cuda-10.2/samples/5_Simulations/nbody$ ./nbody -benchmark
...
GPU Device 0: "Xavier" with compute capability 7.2

> Compute 7.2 CUDA device: [Xavier]
8192 bodies, total time for 10 iterations: 12.876 ms
= 52.121 billion interactions per second
= 1042.410 single-precision GFLOP/s at 20 flops per interaction

FP64

~31 GFLOP/s

nvidia@nvidia-desktop:/usr/local/cuda-10.2/samples/5_Simulations/nbody$ ./nbody -benchmark -fp64
...
GPU Device 0: "Xavier" with compute capability 7.2

> Compute 7.2 CUDA device: [Xavier]
8192 bodies, total time for 10 iterations: 648.288 ms
= 1.035 billion interactions per second
= 31.055 double-precision GFLOP/s at 30 flops per interaction

Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.