OpenACC Interoperability - cufft: Runtime error and printing wrong results when using cufft from openacc region

Hi everyone,

I’m trying for the first time to use #cufft using #openacc. I reproduce my problem with the following simple example. I have as an input an array of 10 real elements (a) initialized with 1, and the output (b) is supposed to be its Fourier transform (b should be zeros except for b[0] = 10 ).

But I encounter a runtime error at the line where cufftPlan1d is called (please see the code below). However, when CHECK_CUFFT macro is not used, the program is executed, but with wrong values of b[i] = 0, i = 0, ..., 9., in fact, they are simply the initial values of b.
I would ask two questions:

1- What is the mistake I made that prevent me from getting the correct answer?
2- How to print out the runtime error, or make some additional diagnostic that allows me to figure out this type of error?

I compiled the code with as follows

#!/bin/bash
compileNV=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvc++
#FLAGS="-O2 -g -gopt -Kieee -Minfo=accel -acc=noautoptr,sync,gpu -gpu=cc75,cuda10.2,lineinfo,ptxinfo -cudalib=curand"
FLAGS="-O2 -lm -lcufft -march=native -g -gopt -Kieee -Minfo=accel -acc=noautopar,sync,gpu -gpu=cc75,cuda12.0,lineinfo,ptxinfo -cudalib=curand"

#export NVCOMPILER_ACC_NOTIFY=3
$compileNV -o 1DFFT_GPU example1DFFT_GPU.cpp $FLAGS

Here is the source code:

/* Example of the problem : example1DFFT_GPU.cpp */
#include <iostream>
#include <vector>
#include <cmath>
#include <complex>

#include "openacc.h"

#include <cuda_runtime.h>
#include <cufftXt.h>

#include "../common.h"

int main()
{   

    const int m = 10;
    std::complex<float> a[m];
    std::complex<float> b[m];

    cufftHandle plan1D; 
    
    for(unsigned int i = 0; i < m; i++)
    {
        a[i] = 1.0;
    }    
    for(unsigned int i = 0; i < m; i++)
    {
        std::cout << a[i].real() << " , " << a[i].imag() << "\n";
    }
    #pragma acc data copyin(a[0 : m]) copyout(b[0 : m])
    {
        CHECK_CUFFT(cufftPlan1d(&plan1D, m, CUFFT_C2C, 1));
        CHECK_CUFFT(cufftSetStream(plan1D, (cudaStream_t) acc_get_cuda_stream(acc_async_sync)));

        #pragma acc host_data use_device(a, b)
        {
            CHECK_CUFFT(cufftExecC2C(   plan1D, 
                                        (cufftComplex *) a, 
                                        (cufftComplex *) b, 
                                        CUFFT_FORWARD));
        }
    }

    for(unsigned int i = 0; i < m; i++)
    {
        std::cout << b[i].real() << " , " << b[i].imag() << "\n";
    }

    CHECK_CUFFT(cufftDestroy(plan1D));
    return 0;
}

Her is the common.h file

/* common.h file */
#include <sys/time.h>
/**
 * @brief This piece of code is copied from 
 * 
 *      https://github.com/deeperlearning/professional-cuda-c-programming/blob/master/examples/common/common.h
 * 
*/

#ifndef _COMMON_H
#define _COMMON_H

#define CHECK(call)                                                            \
{                                                                              \
    const cudaError_t error = call;                                            \
    if (error != cudaSuccess)                                                  \
    {                                                                          \
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);                 \
        fprintf(stderr, "code: %d, reason: %s\n", error,                       \
                cudaGetErrorString(error));                                    \
    }                                                                          \
}

#define CHECK_CUFFT(call)                                                      \
{                                                                              \
    cufftResult err;                                                           \
    if ( (err = (call)) != CUFFT_SUCCESS)                                      \
    {                                                                          \
        fprintf(stderr, "Got CUFFT error %d at %s:%d\n", err, __FILE__,        \
                __LINE__);                                                     \
        exit(1);                                                               \
    }                                                                          \
}
#endif // _COMMON_H

thanks for your help

1- What is the mistake I made that prevent me from getting the correct answer?

I’m not sure what’s wrong since it seem to give correct results when I run your code. The only change I made was to add the label on the print statement and initialize “b” to 2.0, just to ensure the values are getting updated.

Note that I’ve tried multiple GPUs, V100 and A100, compiler versions, and flag sets, but all give the same correct answer.

% nvc++ -O2 -acc -cudalib=curand,cufft example1DFFT_GPU.cpp -V23.3 ; a.out
A 0: 1 , 0
A 1: 1 , 0
A 2: 1 , 0
A 3: 1 , 0
A 4: 1 , 0
A 5: 1 , 0
A 6: 1 , 0
A 7: 1 , 0
A 8: 1 , 0
A 9: 1 , 0
B 0: 10 , 0
B 1: 0 , 0
B 2: 0 , 0
B 3: 0 , 0
B 4: 0 , 0
B 5: 0 , 0
B 6: 0 , 0
B 7: 0 , 0
B 8: 0 , 0
B 9: 0 , 0

2- How to print out the runtime error, or make some additional diagnostic that allows me to figure out this type of error?

My guess is that you have the error checking correct and if you print the error code it will be “0”, i.e. success.

-Mat

1 Like

Thank you for your quick reply and explanation.

I don’t know what exactly the problem is. However, your reply motivates me to change the cuda version I’m using. So, instead of using cuda12.0, I replaced it with cuda11.0. Now, it works correctly. Here is the new scrip used to compile the code

#!/bin/bash
compileNV=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvc++
#FLAGS="-O2 -g -gopt -Kieee -Minfo=accel -acc=noautoptr,sync,gpu -gpu=cc75,cuda10.2,lineinfo,ptxinfo -cudalib=curand"
FLAGS="-O2 -lm -lcufft -march=native -g -gopt -Kieee -Minfo=accel -acc=noautopar,sync,gpu -gpu=cc75,cuda11.0,lineinfo,ptxinfo -cudalib=curand"
#export NVCOMPILER_ACC_NOTIFY=3
$compileNV -o 1DFFT_GPU example1DFFT_GPU.cpp $FLAGS

what do you think?

It seems slow compared to fftw3. Any advice to accelerate it ?

Homam

This leads me to believe that you might not have a CUDA 12.0 Driver installed but rather an 11.x driver. Might have been the problem.

It seems slow compared to fftw3. Any advice to accelerate it ?

A size of 10 is tiny so all the time will be spent in the overhead and data transfer. For performance testing, you’ll want much bigger sizes.

You are right.

nvidia-smi
Tue Jun  6 00:14:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:04:00.0  On |                  N/A |
| 29%   42C    P8    24W / 250W |    415MiB / 11016MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2404      G   /usr/lib/xorg/Xorg                255MiB |
|    0   N/A  N/A      7675      G   /usr/bin/gnome-shell               39MiB |
|    0   N/A  N/A      8099      G   ...AAAAAAAA== --shared-files       12MiB |
|    0   N/A  N/A     10281      G   /usr/lib/firefox/firefox            3MiB |
|    0   N/A  N/A     11025      G   ...RendererForSitePerProcess       99MiB |
+-----------------------------------------------------------------------------+

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.