I’m having a problem with 3D inverse dft’s using the C2R transform. If I use the full C2C transform and take only the real values, I get the expected results. However, if I use packed (n1 x n2 x (n3 / 2 + 1)) complex input and the C2R transform, I get incorrect results and there are some suggestions that out of bounds memory is being accessed. (No segmentation violations, but other somewhat random behavior). Specifically, I’m calling an inverse transform with input which is the complex transform of a 4x4x4 volume with each row of each plane containing [0 .707 1 .707]. I’ve verified that the packed, interleaved input appears correct, at least according to my understanding of the cufft and fftw documentation. With the C2C version, the test1 program below produces exactly the expected result:
Output:
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
but with the C2R version I get:
Output:
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7
Notice that the even numbered rows are correct, but the odd numbered ones are not. The same results are obtained using either in-place or out-of-place transforms. (Code supplied below is for the in-place case).
This is using cuda 4.1 on RHEL6-64 6.2(2.6.32-220.4.2.el6.x86_64) with gcc 4.3.4 or 4.4.5 (both behave identically).
The system is an HP Pavillion with 12 GB of ram, 3.2 GHz 6-core I7 (970) processor, and a GT420. (Yes, I know that’s an underpowered card. I just haven’t gotten around to installing a better one yet.) The driver version is 295.20, although for various reasons our IT folks insist on installing via dkms rather running the Nvidia installer. So far, I haven’t seen any issues caused by this latter fact. I’m using a 3D grid, so I always compile with arch=compute_20 code=sm_20.
I’ve attached files to reproduce both of these cases. File test1.cpp is the common driver, while gpufft3C2R.cu and gpufft3C2C.cu are the bad and good cuda sources, respectively. I’d appreciate it greatly if anyone can point out whether I’ve made some error in the way I’m calling the C2R transform, or whether this suggests a bug in either the cufft library or the GT420 driver.
Thanks in advance!
gpufft3C2C.cu (4.26 KB)
gpufft3C2R.cu (4.56 KB)
test1.cpp (1.17 KB)