Code slower with CUDA 11.4 than CUDA 10.2

andrea · May 30, 2023, 8:55pm

I tried porting a program from CUDA 10.2 to CUDA 11.4 and the program is roughly twice as slow with the newer release.

Here’s a minimal example showing the issue:

#include <algorithm>
#include <cassert>
#include <chrono>
#include <cmath>
#include <complex>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cufftw.h>
#include <iostream>
#include <numeric>
#include <openacc.h>
#include <pthread.h>
#include <sstream>
#include <sys/time.h>
#include <vector>

using namespace std;

static const uint32_t SIGNAL_SIZE = 2048;
static const uint32_t SCAN_SIZE = 1000;
static const uint32_t FRAME_SIZE = SCAN_SIZE * SIGNAL_SIZE;

static const uint32_t N_FRAMES = 100;

void processFrame(const uint16_t *spectra, float *R_A,
                  const float *win, std::complex<float> *C_A, cufftHandle hand)
{

#pragma acc data present(spectra[ : FRAME_SIZE], R_A[ : FRAME_SIZE], win[ : SIGNAL_SIZE], C_A[ : FRAME_SIZE])
  {
#pragma acc parallel loop
    for (uint32_t i = 0; i < FRAME_SIZE; i++)
    {
      C_A[i] = spectra[i] * win[i % SIGNAL_SIZE];
    }
    cufftResult status =
        cufftExecC2C(hand, reinterpret_cast<cufftComplex *>(C_A),
                     reinterpret_cast<cufftComplex *>(C_A), CUFFT_FORWARD);
    assert(status == CUFFT_SUCCESS);

#pragma acc parallel loop
    for (uint32_t i = 0; i < FRAME_SIZE; i++)
    {
      R_A[i] = std::abs(C_A[i]);
    }
  }
}

int main(int argc, char const *argv[])
{
  auto spectra = std::vector<uint16_t>(FRAME_SIZE * N_FRAMES);
  auto R_A = std::vector<float>(FRAME_SIZE);
  auto C_A = std::vector<std::complex<float>>(FRAME_SIZE);
  auto win = std::vector<float>(SIGNAL_SIZE);

  cufftHandle hand;
  cufftResult status;

  status = cufftCreate(&hand);
  assert(status == CUFFT_SUCCESS);

  status = cufftPlan1d(&hand, SIGNAL_SIZE, CUFFT_C2C, SCAN_SIZE);
  assert(status == CUFFT_SUCCESS);

  for (uint32_t i = 0; i < SIGNAL_SIZE; i++)
  {
    win[i] = 1 / (i + 1) * 1 / (i + 3);
  }

  for (uint32_t i = 0; i < FRAME_SIZE; i++)
  {
    spectra[i] = (i * 7) % 4096;
  }

  auto start = std::chrono::system_clock::now();

  for (uint32_t n = 0; n < N_FRAMES; n++)
  {
    processFrame(&spectra[n * FRAME_SIZE], &R_A[0], &win[0], &C_A[0], hand);
  }

  auto end = std::chrono::system_clock::now();
  std::chrono::duration<double> elapsed_seconds = end - start;
  std::time_t end_time = std::chrono::system_clock::to_time_t(end);

  std::cout << "Processing time per frame "
            << (elapsed_seconds.count() / N_FRAMES) * 1000 << "ms" << std::endl;

  cufftDestroy(hand);
}

I compile the program with the following options:

nvc++ -acc -fast -gpu=cc72,managed,cuda11.4 -Minfo=accel -Mcudalib=cufft -std=c++17 -o timing main.cpp

I’m running the code on a Jetson AGX Xavier DevKit, using HPC SDK v20.7 for CUDA 10.2 and v21.9 for CUDA 11.4

With CUDA 10.2 I get an average of 4 ms per frame. With CUDA 11.4 I get an average of 8 ms per frame.

Am I doing something wrong?

MatColgrove · May 31, 2023, 9:04pm

Hi andrea,

Unfortunately, I’m not able to recreate this issue so don’t know what’s wrong.

On an xavier system here, I can get your 4ms time using 20.7 or 21.9 with the CUDA 10.2 and 11.0 cuFFT. However, cuFFT fails for me on this system with CUDA 11.4. Removing your asserts, just to see the timing with 11.4, shows no difference.

Hence I moved to a V100 system where there’s no change in performance (~1.9ms) from 20.7 up to 23.5 using various CUDA versions.

Maybe your performance issue is related to my cuFFT failures, but I’m not sure.

Are you able to profile the code using Nsight-Systems to determine where the difference is performance is occurring?

-Mat

andrea · May 31, 2023, 9:37pm

@MatColgrove:

thanks for your support. Weird that cuFFT fails on your Xavier.

I would love to profile code but my understanding is that profiling with unified memory is not supported on the Jetson platform. Am I wrong?

Andrea

MatColgrove · June 1, 2023, 3:58pm

Am I wrong?

My team focuses on the data center and big HPC systems so I know very little about the embedded systems.

Though according to the Nsight-Systems documentation it looks like Jetson AGX Xavier and Jetson Xavier NX are supported. I don’t see anything specific about support for codes that use UM, but you may be correct.

I just tried on the Xavier that have access to and it wasn’t able to produce any profile but that’s due to an “CUPTI_ERROR_INSUFFICIENT_PRIVILEGES” error.

You can try failing back to using the OpenACC runtime profiler by setting the environment variable “NV_ACC_TIME=1”. It will only show you OpenACC kernel times, not cuFFT or UM, but it would still be useful to know if the regression is coming from the OpenACC loops or elsewhere.

Topic		Replies	Views
cuFFT performance decrease with CUDA 7.0 CUDA Programming and Performance	3	1269	March 27, 2015
cuFFT runs slower inside pthread, but why? CUDA Programming and Performance	8	1434	January 27, 2015
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3174	July 10, 2014
CUDA code too slow Jetson Nano cuda	6	1826	July 26, 2022
CUDA enabled cuFFT slower than general purpose FFTW Jetson TX2	10	4421	October 18, 2021
Performance of Unified Memory in CUDA 11.4 v CUDA 10.2 Jetson AGX Xavier cuda , nvbugs	1	542	June 1, 2023
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1057	October 16, 2023
Array + Array (1D or 2D): Why is performance of my code TERRIBLE? CUDA Programming and Performance cuda , image-processing	6	78	October 21, 2024
Extremely slow CUDA API calls? Jetson TX1	6	2894	October 18, 2021
Performance degradation from CUDA 11.7 to 12.2 and newer with cuFFT CUDA Programming and Performance	3	91	November 13, 2024

Code slower with CUDA 11.4 than CUDA 10.2

Related topics