cufftComplex Data Issues

Hello,

I am using the cuFFT library to perform a real-to-complex 2D FFT on an image.I am having a hard time understanding how data from the cufftComplex data type is stored once the FFT is complete, and why I am having difficulties accessing values from this data type. I have read the NVIDIA cuFFT documentation and looked at previous forums, but no luck.

  1. I am aware the real and imaginary parts are stored in transform.x and transform.y.

  2. Is the transform data stored in a 1D array even though my input data is a 2D array? When I try to index a value by transform[i][j].x I receive the error “operand types are: cufftComplex [ int ]” , which tells me that the transform data is possibly stored in a 1D array regardless of the input dimensions.

  3. When I try to reference a value using transform[i].x I receive a “Bus error (core dumped)” when I run my program.

  4. I am using cudaMallocManaged for my input image and my transform data.

If you’d like to see the source code to understand further, let me know, thank you.

First of all, proper 2D R2C or C2R transforms are not trivial to set up, IMO. The combination of 2-dimensional data with the particular “symmetry” patterns imposed by R2C and C2R takes some careful thought to sort out. This post might be worth some study:

Barring that, from an understanding point, it is often far easier if you just use C2C transforms for both the forward and reverse transform cases, since it is quite a bit easier to understand the data setup. There is some ambiguity in your posting, so maybe you are doing this already.

a doubly-subscripted array access mechanism in C or C++ normally requires a double pointer, i.e. your transform variable should start with a definition like

cufftComplex **transform;

I don’t recommend this approach at all, but if you started with that, and did the appropriate sequence of allocation steps, you could then double-dereference the pointer e.g.

transform[i][j]

The compiler here is effectively telling you this. Your confusion here is a lack of understanding of C programming, not anything specific to CUDA or CUFFT. Having said that, CUFFT expects all forms of input and output data pointers to be only single pointers (study the cufft documentation for the relevant exec function prototypes), e.g.

cufftComplex *transform;

which is probably what you have somewhere in your code, and the compiler response confirms that. This is the right way to go, but you won’t be able to easily do

transform[i][j]

with a single pointer.

This is almost certainly the correct access method (see above) so you likely have a bug in your code. One possibility is that the computed i value is simply out of range for the array. For another possibility, see below.

I’m imagining that the transform variable is your output data, and you are getting the bus error after you run the cufft calls, when trying to inspect the output. In a pre-pascal UM regime, you cannot access a managed pointer after any kernel call (including those launched by CUFFT calls) until you explicitly (in your own code) perform:

cudaDeviceSynchronize();

If you fail to do that prior to the host access to the data, you will get that bus error fault. This is of course just a guess/speculation. Alternatively, if you think it may be the source of a bug in your code, you could just get things working without UM/managed memory first.

Also note that there are various cufft sample codes available, including some that demonstrate 2D transforms:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cufft

Thanks for the help Bob.

You’ve made some interesting points regarding what type of transform to use. Using C2C does seem like the way to go in terms of simplicity, although trying to implement this in a “real-time” fashion seems difficult having to create a new 2D array of float2 data type from a regular float array each time a C2C is going to be performed.

I created a cuda kernel that will do this data type conversion, although it proved fairly slow ~50ms. This time included the time it took to allocate memory on the GPU, which may make up the bulk of this time, but I am not sure I will test this tomorrow. Allocating memory only has to take place once correct? Therefor the conversion may be feasible to implement in real-time with a camera.

Also, do you have any thoughts on performing a 2D R2C with an odd number of rows? If deciding to stick with R2C FFT, I will compute the magnitude of the complex array after the first FFT, and perform another R2C FFT for the project I am working on. The issue is after the first FFT the dimensions of the array are now NX and NY/2+1. Do you foresee any solution to this?

Lastly, with your experience do you have any thoughts on Unified Memory compared to allocating memory specifically for a project that performs FFT on incoming images from a camera? Would like to hear your opinion if one offers any clear advantages over the other for this type of work.

My suggestion to use C2C was based on the idea of simpilification of the problem. When something is “not working” it may help to rule things out programmatically. Once you have things “working” then you can switch back to using R2C if you wish, and then if things break you have only one thing to debug and you know where to focus.

You don’t want to be allocating memory on a GPU during a time-critical processing sequence. Allocate once, what you need, then reuse the allocations. This should certainly be possible with a sequential video processing loop.

Performing 2D R2C with an odd number of rows should be fine. The R2C symmetry issue is a per-row situation in the 2D case, if you read the other forum post I linked. I don’t know of any restrictions on the number of rows in a 2D CUFFT transform.

Unified memory should work ok. Whether or not it is the fastest possible approach would depend a lot on the details of your actual case. Unified Memory is not normally something that makes code run faster, but is a productivity tool to allow the programmer to get from point A to point B with less work. The “best case” for UM would normally be what the programmer can achieve using intelligently placed explicit data copy functions - not better than that normally.