Hi everyone, I’ve tried everything I could to find an answer for these few questions myself (from searching online, reading documentations to implementing and test it), but none have fully satisfied me so far. Hopefully, someone here can help me out with this. Because I’m quite new to to CUDA programming, therefore if possible, could you share any good materials relating to this topic with me?

Problem: I’m porting a matlab code (image processing) to CUDA evironment. The main task is working with 3D-matrix which involves a few popular functions like fftn(), fftshift(), ifftshift() and a loop of Conjugate Gradient. So here are my questions:

- How do I arrange/vectorize a 3D-matrix from matlab to use it with cuFFT? From previous discussions, I know that with 2D array, I have to use column major but none mention about 3D. Besides, I notice that the CuFFT_Lib documentation said nothing about the column major order, but the CUBLAS did.
- I tried to rearrange dimension of a 3D matrix in matlab. Then do the fftn() with it on matlab and cuFFT_Z2Z on CUDA. Here is the result I observed:
a) fftn() always produce the same set of elements (the only difference is their position in output matrix) while cuFFT gives a different set
b) The 1st element of the output matrix is the same with both cuda and matlab, regardless of dimension rearrangement.
I guess not properly prepare the input array for cufft or an error in my code is the culprit for above "symptom", right?! Here is my full code for FFT3D, could you spot any error? It runs fine, but the result is not fine compared to matlab:
`// h_a is array reading from exported text of matlab cudaError_t PerfCuFFT(int argc, char **argv, cufftDoubleComplex *h_a){ cudaError_t cudaStatus; cufftHandle plan; cufftDoubleComplex *data; cufftDoubleComplex *result; cufftDoubleComplex *d_a; int n[NRANK] = { NX, NY, NZ }; // Choosing CUDA device with newer architect int dev = findCudaDevice(argc, (const char **)argv); // Array to read output data from device memory result = (cufftDoubleComplex*)malloc(sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH); cudaMalloc((void**)&data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); goto Error; } cudaMalloc((void**)&d_a, sizeof(cufftDoubleComplex)*NX*NY*NZ); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); goto Error; } // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(d_a, h_a, sizeof(cufftDoubleComplex)*NX*NY*NZ, cudaMemcpyHostToDevice); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } /* Create a 3D FFT plan. */ if (cufftPlanMany(&plan, NRANK, n, NULL, 1, NX*NY*NZ, // *inembed, istride, idist NULL, 1, NX*NY*NZ, // *onembed, ostride, odist CUFFT_Z2Z, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); goto Error; } /* Use the CUFFT plan to transform the signal in place. */ if (cufftExecZ2Z(plan, d_a, data, CUFFT_FORWARD) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: ExecZ2Z Forward failed"); goto Error; } // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(result, data, sizeof(cufftDoubleComplex)*(NX*NY*NZ)*BATCH, cudaMemcpyDeviceToHost); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } cudaStatus = cudaDeviceSynchronize(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus); goto Error; } cudaStatus = cudaGetLastError(); // I will break and read the result from watch windows here Error: cufftDestroy(plan); cudaFree(data); cudaFree(d_a); return cudaStatus; }`

- In case of CUFFT_C2C/CUFFTZ2Z, why do I have to allocate more space for the output of FFT (the
***BATCH**part in the code? Here is the example code I found from CUFFT_Lib document, section 4.4 (page 65):`cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY*NZ*BATCH);`

- Since there is no fftshift()/ifftshift() in cuFFT lib, I found a suggestion here: https://devtalk.nvidia.com/default/topic/515723/does-cuda-provide-fftshift-function-like-matlab-/ to use arrayFire. But after messing around it for few hours, I realized in order to run its samples, I have to copy a ton of old cuda dll file into its folder. It's quite inconvenient. Have someone tried this API before, can you provide your experience with it?
- About the solution from Abdellah for fftshift(), looks like it covers only performing fft on matrix with even number of elements of each dim, not ifft() or matrix with odd number of elements of each dim. I will try to have deeper look into his lib and see if I can expand it further. Have anyone here tried to port fftshift() and ifftshift() function from matlab to cuda c++?
- The cuda sample of parallelizing conjugate gradient method looks very promising, if applicable to my case, it would significantly reduce the time to proceed each frame. I haven't taken look at it yet, therefore, this is just a pure dumb question from me (just ignore it if you find it to be "dumbness overflow"! :)) ): Does conjugate gradient method have variety forms and if yes, can this example be applicable to all of them?

Thank you all so much in advanced for spending your time to support me!

To verify my problem above (and for those who would like to replicate my works), here are my sample matlab code and c++:

Matlab: matrix 5x3x4

```
a=[2 5 6; 9 4 6; 4 6 8; 5 0 3; 3 7 9];
a=repmat(a, [1,1,4]);
a(:,:,2) = [1 8 6; 5 8 9; 5 2 3; 4 8 0; 5 7 9];
a(:,:,3) = [5 9 8; 7 1 3; 6 8 1; 6 5 2; 7 6 9];
a(:,:,4) = [1 2 0; 6 9 5; 4 7 3; 5 7 8; 2 9 3];
dlmwrite('MyFile.txt', a, 'delimiter', '\t', 'newline', 'pc');
// To rearrange dimension, use a = permute(a,[2,1,3]); before exporting to txt file
fftn(a);
```

The rest of my c++ code

```
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
// includes, project
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <helper_functions.h>
#include <helper_cuda.h>
cudaError_t PerfCuFFT(int argc, char **argv, cufftDoubleComplex *h_a);
void ReadFile(char* MyFile, cufftDoubleComplex *array){
FILE* fi = fopen(MyFile, "rt");
int count = 0;
int temp;
while (!feof(fi)){ // && (fi!=NULL)){
fscanf(fi, "%d", &temp);
array[count].x = temp;
array[count].y = 0;
++count;
}
fclose(fi);
}
#define NX 5
#define NY 3
#define NZ 4
#define BATCH 10
#define NRANK 3
int main(int argc, char **argv){
cufftDoubleComplex *Full = new cufftDoubleComplex[60];
ReadFile("myFile.txt", Full);
// Running FFT
cudaError_t cudaStatus = PerfCuFFT(argc, argv, Full);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "CuFFT failed!");
return 1;
}
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
```