cuFFT callbacks not working with large amount of host-pinned memory?

Originally the question title was: “cuFFT callbacks not working for 2D cuFFT plan”, changed later on

Hello,

I’m trying to register a custom kernel that I earlier used as a pre-processing step for a cuFFT execution call as a load callback to that cuFFT execution call. I’ve read the cuFFT related parts of the CUDA Toolkit Documentation and I’ve looked at the simpleCUFFT_callback NVIDIA sample and the example in the following link:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/

I’ve been able to successfully run the example from the URL above and tried changing my own project source accordingly to incorporate my custom kernel as the load callback. The only difference from the example code is that I’m using a 2D cuFFT plan rather than the 1D in the example (I’m using cufftPlan2d(…)). I’ve done some error checking, cufftXtSetCallback() returns CUFFT_SUCCESS (0x0) but the cufftExecC2C() returns CUFFT_EXEC_FAILED (0x6) with the 2D plan I’m using. When I arrange the code so that a 1D plan is used, the execution works out fine and the load callback works.

Is the callback feature not working for 2D plans?

I’m on a Jetson TX1, CUDA 8.0

Thanks,
Burak

The docs specifically indicate that the callback feature works with 2D plans:

http://docs.nvidia.com/cuda/cufft/index.html#callback-routines

That particular doc section does indicate some limitations on using callbacks, however, such as the legal dimensions.

Hi txbob,

Yeah I saw that one and I’ve basically done the exact same procedure as in the URL I’ve included in my original post, just changing the cuFFT plan to a 2D one. This is why I’m confused.

As far as I can see the only phrase on the doc about dimension limitations is “Callback functions are not supported on transforms with a dimension size that does not factor into primes smaller than 127”. Currently I’m working on a 768x768 size so it’s 2 and 3 as prime factors, should be “legal”.

One wild guess I have is: the arguments for the callback function prototype include an offset (second argument), which sort of implies this callback would be looking for a contiguous 1D array. Maybe there is a 2D callback prototype which would take in 2 arguments for index (horizontal & vertical)?

Do you know a working example for a 2D cuFFT callback implementation that I can find online?

Thanks,
Burak

I wasn’t able to locate one on the web. But this is a fairly simple modification of the simpleCUFFT_callback sample code. It does size 50x50 2D forward transform, followed by a 2D inverse transform with a load callback that does the necessary scaling to make the output match the input.

It seems to work correctly for me:

$ cat t354.cu

/*
 * Example showing the use of CUFFT for  2D-FFT with callback.
 * This sample is derived from simpleCUFFT_callback, and uses a callback
 * function to perform the pointwise scale, on input to the
 * inverse transform.
 *
*/

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <helper_functions.h>
#include <helper_cuda.h>

// Complex data type
typedef float2 Complex;
static __device__ __host__ inline Complex ComplexScale(Complex, float);

// This is the callback routine prototype
static __device__ cufftComplex ComplexPointwiseScale(void * a, size_t index, void * cb_info, void *sharedmem);

typedef struct _cb_params{
                float scale;
                } cb_params;

// This is the callback routine. It does complex pointwise scaling.
static __device__ cufftComplex ComplexPointwiseScale(void *a, size_t index, void *cb_info, void *sharedmem)
{
        cb_params * my_params = (cb_params *)cb_info;
        return (cufftComplex)ComplexScale(((Complex *)a)[index], my_params->scale);
}

// Define the device pointer to the callback routine. The host code will fetch this and pass it to CUFFT
 __device__ cufftCallbackLoadC myOwnCallbackPtr = ComplexPointwiseScale;

#define SIGNAL_SIZE 50

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUFFT callbacks
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    printf("[CUFFT_2D_callback] is starting...\n");

// Allocate host memory for the signal
    int nD_size = SIGNAL_SIZE * SIGNAL_SIZE;
    int mem_size = sizeof(Complex) * nD_size;
    Complex *h_signal = (Complex *)malloc(mem_size);

// Initialize the memory for the signal
    for (unsigned int i = 0; i < nD_size; ++i)
    {
        h_signal[i].x = rand() / (float)RAND_MAX;
        h_signal[i].y = 0;
    }

// Allocate device memory for signal
    Complex *d_signal;
    checkCudaErrors(cudaMalloc((void **)&d_signal, mem_size));
    // Copy host memory to device
    checkCudaErrors(cudaMemcpy(d_signal, h_signal, mem_size,
                               cudaMemcpyHostToDevice));

// Create one CUFFT plan for the forward transforms, and one for the reverse transform
    // with load callback.
    cufftHandle plan, cb_plan;
    size_t work_size;
    size_t cb_work_size;

    checkCudaErrors(cufftCreate(&plan));
    checkCudaErrors(cufftCreate(&cb_plan));

    checkCudaErrors(cufftMakePlan2d(plan, SIGNAL_SIZE, SIGNAL_SIZE, CUFFT_C2C, &work_size));
    checkCudaErrors(cufftMakePlan2d(cb_plan, SIGNAL_SIZE, SIGNAL_SIZE, CUFFT_C2C, &cb_work_size));

    // Define a structure used to pass in the scale factor
    cb_params h_params;
    h_params.scale = 1.0f / nD_size;

    // Allocate device memory for parameters
    cb_params *d_params;
    checkCudaErrors(cudaMalloc((void **)&d_params, sizeof(cb_params)));

    // Copy host memory to device
    checkCudaErrors(cudaMemcpy(d_params, &h_params, sizeof(cb_params),
                               cudaMemcpyHostToDevice));

    // The host needs to get a copy of the device pointer to the callback
    cufftCallbackLoadC hostCopyOfCallbackPtr;

    checkCudaErrors(cudaMemcpyFromSymbol(&hostCopyOfCallbackPtr,
                                          myOwnCallbackPtr,
                                          sizeof(hostCopyOfCallbackPtr)));

    // Now associate the load callback with the plan.
    cufftResult status = cufftXtSetCallback(cb_plan,
                                            (void **)&hostCopyOfCallbackPtr,
                                            CUFFT_CB_LD_COMPLEX,
                                            (void **)&d_params);
    if (status == CUFFT_LICENSE_ERROR)
    {
        printf("This sample requires a valid license file.\n");
        printf("The file was either not found, out of date, or otherwise invalid.\n");
        return EXIT_WAIVED;
    }

    checkCudaErrors(cufftXtSetCallback(cb_plan,
                                       (void **)&hostCopyOfCallbackPtr,
                                       CUFFT_CB_LD_COMPLEX,
                                       (void **)&d_params));

    // Transform signal
    printf("Transforming signal cufftExecC2C\n");
    checkCudaErrors(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));

printf("Transforming signal back cufftExecC2C\n");
    checkCudaErrors(cufftExecC2C(cb_plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE));

    // Copy device memory to host
    Complex *h_signal_res = (Complex *)malloc(mem_size);
    checkCudaErrors(cudaMemcpy(h_signal_res, d_signal, mem_size,
                               cudaMemcpyDeviceToHost));

    bool bTestResult = sdkCompareL2fe((float *)h_signal_res, (float *)h_signal, nD_size, 1e-5f);
    if (bTestResult) printf("results match!\n");
    else printf("mismatch!\n");
    checkCudaErrors(cufftDestroy(plan));
    checkCudaErrors(cufftDestroy(cb_plan));

    // cleanup memory
    free(h_signal);
    free(h_signal_res);
    checkCudaErrors(cudaFree(d_signal));
    checkCudaErrors(cudaFree(d_params));

    return bTestResult ? EXIT_SUCCESS : EXIT_FAILURE;
}

////////////////////////////////////////////////////////////////////////////////
// Complex operations
////////////////////////////////////////////////////////////////////////////////

// Complex scale
static __device__ __host__ inline Complex ComplexScale(Complex a, float s)
{
    Complex c;
    c.x = s * a.x;
    c.y = s * a.y;
    return c;
}

$ nvcc -I/usr/local/cuda/samples/common/inc -arch=sm_61 -o t354 t354.cu -lcufft_static -lculibos -rdc=true
$ ./t354
[CUFFT_2D_callback] is starting...
Transforming signal cufftExecC2C
Transforming signal back cufftExecC2C
results match!
$

CUDA 8.0.61, Ubuntu 14.04, Pascal Titan X

Changed question title to: cuFFT callbacks not working with large amount of host-pinned memory?

Thanks a lot for the reference snippet txbob.

I don’t have an exhaustive explanation for the problem but I have found the reason why and solved it for my case and will leave recommendations here.

It had nothing to do with the cuFFT plan being 2D rather than 1D. I had too much host-pinned memory (cudaMallocHost()) used in my code and the cuFFT callbacks somehow did not like this.

How did I determine this? For the same large amount of host-pinned memory:

  • Without the load callback assigned, the cuFFT execution call works OK.
  • With the load callback assigned, it returns a 0x6 (CUFFT_EXEC_FAILED).

I then assigned the callback to the plan and started reducing the host-pinned memory amount gradually. It just stopped returning 0x6 after some point and worked perfectly. Increasing it back again produced the same error.

I also succeeding in “breaking” txbobs working code by adding large host-pinned memory allocations (did some dummy operations on those memory locations so that they don’t get optimized/deleted away). In light of these I’ve concluded there’s some relation between cuFFT callbacks and host-pinned memory allocations. Wouldn’t dare to call it a bug, but it’s definitely some undocumented feature.

An explanation would be great if somebody knows more about this

About previous claims:

This was not the case. There were a lot of memory allocations in my code which were all host-pinned (cudaMallocHost()). I missed them earlier (remnants from copy-pastes)

I was accidentally also commenting out the above-mentioned mallocs while changing from the 2D to 1D implementation

There isn’t. The same callback format is used for 2D as well as 1D.

Burak

You’re on a Jetson. A large amount of host-pinned memory takes away from device memory available, and therefore any memory to be used by CUFFT (i.e. device memory).

If you reserve enough host-pinned memory (on Jetson), then you are reducing the amount of memory that is available for device code operations, including CUFFT. A CUFFT exec call will make a device memory allocation (temporary “workspace” allocation), in order to perform its activities.

If you reduce the amount of available memory sufficiently, the CUFFT exec call will fail.

It’s entirely possible that CUFFT callbacks increase the amount of device memory that CUFFT needs to temporarily allocate in order to perform a CUFFT exec call.

I’m reasonably sure this would not be witnessed (ie. the interaction with host-pinned memory and available device memory for CUFFT calls) on anything but a Jetson platform.

CUFFT device memory utilization is covered in some detail in the CUFFT manual, and you can take over the temporary allocation process yourself (e.g. to reuse allocations, etc.) I suspect if you did this correctly, you’d observe your own temporary allocation to fail as well, once the amount of reserved memory gets large enough. I also think you could test the theory of the CUFFT callback increasing the memory demand, by taking over the temporary “workspace” allocation process from CUFFT.