Questions about cuFFT for 3D matrix, arrayFire

Hi everyone, I’ve tried everything I could to find an answer for these few questions myself (from searching online, reading documentations to implementing and test it), but none have fully satisfied me so far. Hopefully, someone here can help me out with this. Because I’m quite new to to CUDA programming, therefore if possible, could you share any good materials relating to this topic with me?

Problem: I’m porting a matlab code (image processing) to CUDA evironment. The main task is working with 3D-matrix which involves a few popular functions like fftn(), fftshift(), ifftshift() and a loop of Conjugate Gradient. So here are my questions:

  1. How do I arrange/vectorize a 3D-matrix from matlab to use it with cuFFT? From previous discussions, I know that with 2D array, I have to use column major but none mention about 3D. Besides, I notice that the CuFFT_Lib documentation said nothing about the column major order, but the CUBLAS did.
  2. I tried to rearrange dimension of a 3D matrix in matlab. Then do the fftn() with it on matlab and cuFFT_Z2Z on CUDA. Here is the result I observed: a) fftn() always produce the same set of elements (the only difference is their position in output matrix) while cuFFT gives a different set b) The 1st element of the output matrix is the same with both cuda and matlab, regardless of dimension rearrangement. I guess not properly prepare the input array for cufft or an error in my code is the culprit for above "symptom", right?! Here is my full code for FFT3D, could you spot any error? It runs fine, but the result is not fine compared to matlab:
    // h_a is array reading from exported text of matlab
    cudaError_t PerfCuFFT(int argc, char **argv, cufftDoubleComplex *h_a){
    	cudaError_t cudaStatus;
    	cufftHandle plan;
    	cufftDoubleComplex *data;
    	cufftDoubleComplex *result;
    	cufftDoubleComplex *d_a;
    	int n[NRANK] = { NX, NY, NZ };
    
    	// Choosing CUDA device with newer architect
    	int dev = findCudaDevice(argc, (const char **)argv);
    
    	// Array to read output data from device memory
    	result = (cufftDoubleComplex*)malloc(sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH);
    
    	cudaMalloc((void**)&data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH);
    	if (cudaGetLastError() != cudaSuccess){
    		fprintf(stderr, "Cuda error: Failed to allocate\n");
    		goto Error;
    	}
    
    	cudaMalloc((void**)&d_a, sizeof(cufftDoubleComplex)*NX*NY*NZ);
    	if (cudaGetLastError() != cudaSuccess){
    		fprintf(stderr, "Cuda error: Failed to allocate\n");
    		goto Error;
    	}
    
    	// Copy input vectors from host memory to GPU buffers.
    	cudaStatus = cudaMemcpy(d_a, h_a, sizeof(cufftDoubleComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);
    	if (cudaStatus != cudaSuccess) {
    		fprintf(stderr, "cudaMemcpy failed!");
    		goto Error;
    	}
    
    	/* Create a 3D FFT plan. */
    	if (cufftPlanMany(&plan, NRANK, n,
    		NULL, 1, NX*NY*NZ, // *inembed, istride, idist 
    		NULL, 1, NX*NY*NZ, // *onembed, ostride, odist
    		CUFFT_Z2Z, BATCH) != CUFFT_SUCCESS){
    		fprintf(stderr, "CUFFT error: Plan creation failed");
    		goto Error;
    	}
    
    	/* Use the CUFFT plan to transform the signal in place. */
    	if (cufftExecZ2Z(plan, d_a, data, CUFFT_FORWARD) != CUFFT_SUCCESS){
    		fprintf(stderr, "CUFFT error: ExecZ2Z Forward failed");
    		goto Error;
    	}
    
    	// Copy output vector from GPU buffer to host memory.
    	cudaStatus = cudaMemcpy(result, data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH, cudaMemcpyDeviceToHost);
    	if (cudaStatus != cudaSuccess) {
    		fprintf(stderr, "cudaMemcpy failed!");
    		goto Error;
    	}
    
    	cudaStatus = cudaDeviceSynchronize();
    	if (cudaStatus != cudaSuccess) {
    		fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    		goto Error;
    	}
    
    	cudaStatus = cudaGetLastError();
    
    	// I will break and read the result from watch windows here
    Error:
    	cufftDestroy(plan);
    	cudaFree(data);
    	cudaFree(d_a);
    
    	return cudaStatus;
    }
    
  3. In case of CUFFT_C2C/CUFFTZ2Z, why do I have to allocate more space for the output of FFT (the *BATCH part in the code? Here is the example code I found from CUFFT_Lib document, section 4.4 (page 65):
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY*NZ*BATCH);
    
  4. Since there is no fftshift()/ifftshift() in cuFFT lib, I found a suggestion here: https://devtalk.nvidia.com/default/topic/515723/does-cuda-provide-fftshift-function-like-matlab-/ to use arrayFire. But after messing around it for few hours, I realized in order to run its samples, I have to copy a ton of old cuda dll file into its folder. It's quite inconvenient. Have someone tried this API before, can you provide your experience with it?
  5. About the solution from Abdellah for fftshift(), looks like it covers only performing fft on matrix with even number of elements of each dim, not ifft() or matrix with odd number of elements of each dim. I will try to have deeper look into his lib and see if I can expand it further. Have anyone here tried to port fftshift() and ifftshift() function from matlab to cuda c++?
  6. The cuda sample of parallelizing conjugate gradient method looks very promising, if applicable to my case, it would significantly reduce the time to proceed each frame. I haven't taken look at it yet, therefore, this is just a pure dumb question from me (just ignore it if you find it to be "dumbness overflow"! :)) ): Does conjugate gradient method have variety forms and if yes, can this example be applicable to all of them?

Thank you all so much in advanced for spending your time to support me!

To verify my problem above (and for those who would like to replicate my works), here are my sample matlab code and c++:
Matlab: matrix 5x3x4

a=[2 5 6; 9 4 6; 4 6 8; 5 0 3; 3 7 9];
a=repmat(a, [1,1,4]);
a(:,:,2) = [1 8 6; 5 8 9; 5 2 3; 4 8 0; 5 7 9];
a(:,:,3) = [5 9 8; 7 1 3; 6 8 1; 6 5 2; 7 6 9];
a(:,:,4) = [1 2 0; 6 9 5; 4 7 3; 5 7 8; 2 9 3];
dlmwrite('MyFile.txt', a, 'delimiter', '\t', 'newline', 'pc');
// To rearrange dimension, use a = permute(a,[2,1,3]); before exporting to txt file
fftn(a);

The rest of my c++ code

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <helper_functions.h>
#include <helper_cuda.h>

cudaError_t PerfCuFFT(int argc, char **argv, cufftDoubleComplex *h_a);

void ReadFile(char* MyFile, cufftDoubleComplex *array){
	FILE* fi = fopen(MyFile, "rt");
	int count = 0;
	int temp;
	while (!feof(fi)){ // && (fi!=NULL)){
		fscanf(fi, "%d", &temp);
		array[count].x = temp;
		array[count].y = 0;
		++count;
	}
	fclose(fi);
}

#define NX 5
#define NY 3
#define NZ 4
#define BATCH 10
#define NRANK 3

int main(int argc, char **argv){
	cufftDoubleComplex *Full = new cufftDoubleComplex[60];
	ReadFile("myFile.txt", Full);

	// Running FFT
	cudaError_t cudaStatus = PerfCuFFT(argc, argv, Full);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "CuFFT failed!");
		return 1;
	}

	cudaStatus = cudaDeviceReset();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaDeviceReset failed!");
		return 1;
	}

	return 0;
}
  1. CUFFT doesn’t expect column-major. CUBLAS does.
  2. data rearrangement shouldn’t be necessary, however you may need to “reverse” the sense of the X and Z dimensions of the transform (see below)
  3. the batch parameter is used (i.e. is greater than 1) when you are doing a batched transform (i.e. multiple independent transforms of the same dimensions, datatypes, and direction, launched at the same time).
    http://docs.nvidia.com/cuda/cufft/index.html#cufft-setup What you have shown here shouldn’t require any batching. You are just doing a single transform.

Note that a matrix of 5x3x4 is 5 rows, 3 columns, 4 z-planes. Therefore one might think that NX=3, NY=5, and NZ=4. If that is your thought process, you should reverse NX and NZ. (see below) (However, your choice of NX=5, NY=3, and NZ=4 are not something I can explain or rationalize).

Here’s a full test case using your matlab script run through octave, compared to the data presented in the same order in C++, with a “reversal” of NX and NZ, to demonstrate the same output:

$ cat c.cu
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <assert.h>

#define NX 4
#define NY 5
#define NZ 3
#define BATCH 1
#define NRANK 3

cudaError_t PerfCuFFT(int argc, char **argv, cufftDoubleComplex *h_a){
        cudaError_t cudaStatus;
        cufftHandle plan;
        cufftDoubleComplex *data;
        cufftDoubleComplex *result;
        cufftDoubleComplex *d_a;
        int n[NRANK] = { NX, NY, NZ };

        // Choosing CUDA device with newer architect
        int dev = findCudaDevice(argc, (const char **)argv);

        // Array to read output data from device memory
        result = (cufftDoubleComplex*)malloc(sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH);

        cudaMalloc((void**)&data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH);
        if (cudaGetLastError() != cudaSuccess){
                fprintf(stderr, "Cuda error: Failed to allocate\n");
                goto Error;
        }

        cudaMalloc((void**)&d_a, sizeof(cufftDoubleComplex)*NX*NY*NZ);
        if (cudaGetLastError() != cudaSuccess){
                fprintf(stderr, "Cuda error: Failed to allocate\n");
                goto Error;
        }

        // Copy input vectors from host memory to GPU buffers.
        cudaStatus = cudaMemcpy(d_a, h_a, sizeof(cufftDoubleComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);
        if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "cudaMemcpy failed!");
                goto Error;
        }

        /* Create a 3D FFT plan. */
        if (cufftPlanMany(&plan, NRANK, n,
                NULL, 1, NX*NY*NZ, // *inembed, istride, idist
                NULL, 1, NX*NY*NZ, // *onembed, ostride, odist
                CUFFT_Z2Z, BATCH) != CUFFT_SUCCESS){
                fprintf(stderr, "CUFFT error: Plan creation failed");
                goto Error;
        }

        /* Use the CUFFT plan to transform the signal */
        if (cufftExecZ2Z(plan, d_a, data, CUFFT_FORWARD) != CUFFT_SUCCESS){
                fprintf(stderr, "CUFFT error: ExecZ2Z Forward failed");
                goto Error;
        }

        // Copy output vector from GPU buffer to host memory.
        cudaStatus = cudaMemcpy(result, data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH, cudaMemcpyDeviceToHost);
        if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "cudaMemcpy failed!");
                goto Error;
        }

        cudaStatus = cudaDeviceSynchronize();
        if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
                goto Error;
        }

        cudaStatus = cudaGetLastError();
        if (cudaStatus != cudaSuccess) {printf("cuda error: %s\n", cudaGetErrorString(cudaStatus)); }
        for (int i = 0; i < NX; i++){
          for (int j = 0; j < NY; j++) {
            for (int k = 0; k < NZ; k++) printf("%f, %f    ", result[((NZ*NY)*i)+(NZ*j)+k].x, result[((NY*NZ)*i)+(NZ*j)+k].y);
            printf("\n");}
          printf("\n"); }
Error:
        cufftDestroy(plan);
        cudaFree(data);
        cudaFree(d_a);

        return cudaStatus;
}

int main(int argc, char **argv){
        // (should) match the order of data "presented" in the matlab script:
        float a[] = {2, 5, 6, 9, 4, 6, 4, 6, 8, 5, 0, 3, 3, 7, 9, 1, 8, 6, 5, 8, 9, 5, 2, 3, 4, 8, 0, 5, 7, 9, 5, 9, 8, 7, 1, 3, 6, 8, 1, 6, 5, 2, 7, 6, 9, 1, 2, 0, 6, 9, 5, 4, 7, 3, 5, 7, 8, 2, 9, 3};
        assert((sizeof(a)/sizeof(a[0])) == NX*NY*NZ);
        cufftDoubleComplex *Full = new cufftDoubleComplex[NX*NY*NZ] ;
        for (int i = 0; i < NX*NY*NZ; i++) {Full[i].x = a[i]; Full[i].y = 0;}
        // Running FFT
        cudaError_t cudaStatus = PerfCuFFT(argc, argv, Full);
        if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "CuFFT failed!");
                return 1;
        }

        cudaStatus = cudaDeviceReset();
        if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "cudaDeviceReset failed!");
                return 1;
        }

        return 0;
}
$ nvcc -arch=sm_60 -o c c.cu -lcufft -I/usr/local/cuda/samples/common/inc
$ cat my_fft.m
a=[2 5 6; 9 4 6; 4 6 8; 5 0 3; 3 7 9];
a=repmat(a, [1,1,4]);
a(:,:,2) = [1 8 6; 5 8 9; 5 2 3; 4 8 0; 5 7 9];
a(:,:,3) = [5 9 8; 7 1 3; 6 8 1; 6 5 2; 7 6 9];
a(:,:,4) = [1 2 0; 6 9 5; 4 7 3; 5 7 8; 2 9 3];
fftn(a)
$ octave -q my_fft.m
ans =

ans(:,:,1) =

   311.0000 +   0.0000i   -17.5000 -  14.7224i   -17.5000 +  14.7224i
     9.7426 +   1.4531i   -18.8127 -   6.5301i   -17.7947 -  21.6913i
   -32.7426 +   6.1554i    -4.6227 -  22.2007i    -6.2699 -   4.4414i
   -32.7426 -   6.1554i    -6.2699 +   4.4414i    -4.6227 +  22.2007i
     9.7426 -   1.4531i   -17.7947 +  21.6913i   -18.8127 +   6.5301i

ans(:,:,2) =

   -6.0000 -  9.0000i   -1.2058 + 15.3564i  -16.7942 - 12.3564i
   -3.4327 - 39.6533i   12.5052 -  2.4789i   -4.7594 + 24.9224i
   -6.9200 +  0.1322i   -5.7642 + 15.1493i   18.5169 - 23.8607i
  -20.4063 -  2.1535i   -7.8570 +  3.6093i    8.5765 + 16.8316i
   -8.2410 -  9.3255i    2.3218 +  7.0241i   -5.5397 + 15.8029i

ans(:,:,3) =

    9.0000 +  0.0000i   19.5000 + 21.6506i   19.5000 - 21.6506i
   16.7639 +  5.3633i   -7.0429 -  0.4103i    5.4143 -  8.8960i
   21.2361 + 29.6013i   -5.0410 - 24.3985i  -17.8303 - 11.5827i
   21.2361 - 29.6013i  -17.8303 + 11.5827i   -5.0410 + 24.3985i
   16.7639 -  5.3633i    5.4143 +  8.8960i   -7.0429 +  0.4103i

ans(:,:,4) =

   -6.0000 +  9.0000i  -16.7942 + 12.3564i   -1.2058 - 15.3564i
   -8.2410 +  9.3255i   -5.5397 - 15.8029i    2.3218 -  7.0241i
  -20.4063 +  2.1535i    8.5765 - 16.8316i   -7.8570 -  3.6093i
   -6.9200 -  0.1322i   18.5169 + 23.8607i   -5.7642 - 15.1493i
   -3.4327 + 39.6533i   -4.7594 - 24.9224i   12.5052 +  2.4789i

$ ./c
GPU Device 0: "Tesla P100-PCIE-16GB" with compute capability 6.0

311.000000, 0.000000    -17.500000, -14.722432    -17.500000, 14.722432
9.742646, 1.453085    -18.812732, -6.530142    -17.794658, -21.691283
-32.742646, 6.155367    -4.622665, -22.200656    -6.269944, -4.441438
-32.742646, -6.155367    -6.269944, 4.441438    -4.622665, 22.200656
9.742646, -1.453085    -17.794658, 21.691283    -18.812732, 6.530142

-6.000000, -9.000000    -1.205771, 15.356406    -16.794229, -12.356406
-3.432739, -39.653261    12.505161, -2.478856    -4.759404, 24.922353
-6.919967, 0.132171    -5.764188, 15.149300    18.516866, -23.860674
-20.406271, -2.153457    -7.857037, 3.609302    8.576496, 16.831562
-8.241023, -9.325453    2.321836, 7.024102    -5.539729, 15.802911

9.000000, 0.000000    19.500000, 21.650635    19.500000, -21.650635
16.763932, 5.363312    -7.042949, -0.410327    5.414272, -8.895968
21.236068, 29.601265    -5.040983, -24.398457    -17.830339, -11.582689
21.236068, -29.601265    -17.830339, 11.582689    -5.040983, 24.398457
16.763932, -5.363312    5.414272, 8.895968    -7.042949, 0.410327

-6.000000, 9.000000    -16.794229, 12.356406    -1.205771, -15.356406
-8.241023, 9.325453    -5.539729, -15.802911    2.321836, -7.024102
-20.406271, 2.153457    8.576496, -16.831562    -7.857037, -3.609302
-6.919967, -0.132171    18.516866, 23.860674    -5.764188, -15.149300
-3.432739, 39.653261    -4.759404, -24.922353    12.505161, 2.478856

$

Regarding the “reversal” of NX and NZ, consider:

  1. 3D sample code:
    http://docs.nvidia.com/cuda/cufft/index.html#threed-complex-to-complex-xt-transforms
result = cufftMakePlan3d (plan_input, nz, ny, nx, CUFFT_C2C, worksize);
                                      ^^      ^^
  1. Description of NX, NY, NZ parameters for 3D transform:

http://docs.nvidia.com/cuda/cufft/index.html#function-cufftmakeplan3d

nx The transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.

ny The transform size in the y dimension. For multiple GPUs, this must be factorable into primes less than or equal to 127.

nz The transform size in the z dimension. This is fastest changing dimension of a transform (contiguous in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.

  1. This is consistent with the indexing shown for the data layout:

http://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout

3D

input[b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride]

output[b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride]

1 Like

Hi txbob, thanks so much for your help! Your reply contains very rich of information and is exactly what I’m looking for.

  1. Cleared! Maybe because those discussions I found only focus on 2D array, therefore, people over there always found a solution by switching 2 dimension and thought that it has something to do with row-column major.
  2. My thought here was simple: 5x3x4 is 5 rows (NX), 3 columns (NY), 4 z-planes (NZ). However, from the way you sort the matrix, I realize ny is the fastest changing dimension here: a(Y1, X1, Z1) -> a(Y2, X1, Z1) -> ... Maybe there is something wrong with my view/perspective/understanding of matrix dimension. It confused me since the start with matlab code! :))
  3. I've not touch anything related to BATCH processing/programming yet, and I cannot find any materials regarding to it either (just simply google "Batch CUDA programming" wouldn't help). Could you show me where to find a description/tutorial/example about it?

Popup questions

  1. From the section 2.4 and 2.5 (Data Layout & Multidimensional Transforms), it looks like one has to manually copy the rest half of the output data in case of R2C transform, right? Because I cannot find an examle code to do it, I have to use the C2C/Z2Z transform to avoid it, which would double the workload. Can you provide the format to copy array in case of 1/2/3D?
  2. The provided function findCudaDevice(), seems to favor newer architect instead of higher performance GPU. My HP Z640 has Quadro K620 and K4200 in it. And this function always choose the K620 over the K4200 to which I found quite funny! :)). Should I give this feedback to nvidia dev team?

For batch cufft example, do a google search on “batch cufft example”. I did that and found plenty of good material in the first 5 hits from google.

Here is an example discussing 1D and 2D R2C transform:
https://devtalk.nvidia.com/default/topic/826819/2d-cufft-wrong-result/

You can file bugs as a registered developer at http://developer.nvidia.com

✔ I was fancying it to much while a simple phrase would do a better job! :))

✔ Even though it is used for displaying the result, I can still convert it to matrix copy task. Do you prefer this “symmetry method” or the full matrix transforming one? The copy task seems to take quite sometime but would it still be faster?

✔ I will
Now, waiting for someone who have experience with the rest of my questions. I’m thinking of writing my own library base on the works of Marwan Abdellah here: https://www.researchgate.net/publication/278847958_CufftShift_High_performance_CUDA-accelerated_FFT-shift_library
His fftshift for CUDA “is limited to operate only on 1D arrays of even sizes and 2D/3D arrays with power-of-two sizes and unified dimensionality”.
Again, thank you for your great help, txbob!