NukadaFFT library

@avidday & @nukada: thanks for the suggestions.

I have an error regarding

error while loading shared libraries: libnufft.so: cannot open shared object file: No such file or directory

when I have included the include directory and linked lnufft

My executable script looks like this:

nvcc -g -G -pg -D_DEBUG -o ../obj/ao76_fft8_batch50 ../src/ao76_fft8_batch50.cu \

--host-compilation C -arch sm_13 \

--ptxas-options=-v \

-I/usr/local/cuda/include \

-L/usr/local/cuda/lib64 -lcuda -lcudart \

-I/home/vivekv/CUDA_3.1/NukadaFFT-1.0/include \

-L/home/vivekv/CUDA_3.1/NukadaFFT-1.0/lib64 -lnufft \

-I/home/vivekv/NVIDIA_GPU_Computing_SDK/C/common/inc/ \

-L/home/vivekv/NVIDIA_GPU_Computing_SDK/C/lib/ -lcutil_x86_64 \

-I/usr/include/ -L/usr/lib64/ -lm -lfftw3

è°¢è°¢ä½ ï¼ ä½ å¾ˆåŽ‰å®³ï¼

è°¢è°¢ä½ ï¼ ä½ å¾ˆåŽ‰å®³ï¼

The library obviously isn’t installed where you think it is. In the original distribution tree, this works for me:

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ nvcc -arch=sm_20 -I../../include -L../../lib64/ runtime.cu -lnufft -lcufft -lcuda

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ ldd a.out

	linux-vdso.so.1 =>  (0x00007fff5dfff000)

	libnufft.so => not found

	libcufft.so.3 => /opt/cuda-3.0/lib64/libcufft.so.3 (0x00007f76ddca7000)

	libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f76dd251000)

	libcudart.so.3 => /opt/cuda-3.0/lib64/libcudart.so.3 (0x00007f76dd016000)

	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f76dcd09000)

	libm.so.6 => /lib/libm.so.6 (0x00007f76dca84000)

	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f76dc86c000)

	libc.so.6 => /lib/libc.so.6 (0x00007f76dc4fa000)

	libdl.so.2 => /lib/libdl.so.2 (0x00007f76dc2f6000)

	libpthread.so.0 => /lib/libpthread.so.0 (0x00007f76dc0da000)

	libz.so.1 => /lib/libz.so.1 (0x00007f76dbec2000)

	librt.so.1 => /lib/librt.so.1 (0x00007f76dbcba000)

	/lib64/ld-linux-x86-64.so.2 (0x00007f76deb61000)

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../../lib64 ./a.out 256b1024

Batched 1-D FFT: 256, batch = 1024

Total: 0.878848 msec, 23.862510 GFLOPS.

On-board: 0.058272 msec, 179.945084 GFLOPS.

Max error = 2.384186e-07.

The library obviously isn’t installed where you think it is. In the original distribution tree, this works for me:

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ nvcc -arch=sm_20 -I../../include -L../../lib64/ runtime.cu -lnufft -lcufft -lcuda

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ ldd a.out

	linux-vdso.so.1 =>  (0x00007fff5dfff000)

	libnufft.so => not found

	libcufft.so.3 => /opt/cuda-3.0/lib64/libcufft.so.3 (0x00007f76ddca7000)

	libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f76dd251000)

	libcudart.so.3 => /opt/cuda-3.0/lib64/libcudart.so.3 (0x00007f76dd016000)

	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f76dcd09000)

	libm.so.6 => /lib/libm.so.6 (0x00007f76dca84000)

	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f76dc86c000)

	libc.so.6 => /lib/libc.so.6 (0x00007f76dc4fa000)

	libdl.so.2 => /lib/libdl.so.2 (0x00007f76dc2f6000)

	libpthread.so.0 => /lib/libpthread.so.0 (0x00007f76dc0da000)

	libz.so.1 => /lib/libz.so.1 (0x00007f76dbec2000)

	librt.so.1 => /lib/librt.so.1 (0x00007f76dbcba000)

	/lib64/ld-linux-x86-64.so.2 (0x00007f76deb61000)

avidday@cuda:~/build/NukadaFFT-1.0/sample/runtime$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../../lib64 ./a.out 256b1024

Batched 1-D FFT: 256, batch = 1024

Total: 0.878848 msec, 23.862510 GFLOPS.

On-board: 0.058272 msec, 179.945084 GFLOPS.

Max error = 2.384186e-07.

@avidday: THanks for that info !!!

After using the Nukada FFT library with CUDA 3.1 for a 400 point transform with batch size of 50, my timing as per the profile info is:

Method   #calls   GPU Time  %GPU time

inversey0	 1	   44.416	   5.68

forwardy0	 2	   88.288	  11.29

inversex	 1	 180.448	  23.08

forwardx	 2	 360.576	  46.12

------------------------------------------

						673.668 us

Using the CUFFT library with CUDA 3.1:

Method			 #calls	  GPU Time   %GPU time				 

SP_c2c_mradix_sp_kernel	51	   681.632		98.86

@avidday: THanks for that info !!!

After using the Nukada FFT library with CUDA 3.1 for a 400 point transform with batch size of 50, my timing as per the profile info is:

Method   #calls   GPU Time  %GPU time

inversey0	 1	   44.416	   5.68

forwardy0	 2	   88.288	  11.29

inversex	 1	 180.448	  23.08

forwardx	 2	 360.576	  46.12

------------------------------------------

						673.668 us

Using the CUFFT library with CUDA 3.1:

Method			 #calls	  GPU Time   %GPU time				 

SP_c2c_mradix_sp_kernel	51	   681.632		98.86

Hi,
I’m interested in this lib and its supposed to have Windows support but I can only find .so files not .lib nor dlls so you have some timeline for Windows release… it will have Visual Studio libraries or Cygwin/MINGW ones?
Since I have moved to CUDA 3.2 I hope Windows libs have 3.2 support…
Also can we expect a recompilation for MACOS also?
Sorry for asking to much…

Hi,
I’m interested in this lib and its supposed to have Windows support but I can only find .so files not .lib nor dlls so you have some timeline for Windows release… it will have Visual Studio libraries or Cygwin/MINGW ones?
Since I have moved to CUDA 3.2 I hope Windows libs have 3.2 support…
Also can we expect a recompilation for MACOS also?
Sorry for asking to much…

If the profile info comes from second run, i.e. tuning results are already stored in the database file.

Then you called one forward transform in addition to creating plan, which internally calls

one forward and one backward.

By the way, the code performs 2-D FFT…since forwardy0 is a CUDA kernel for dimension Y.

If the profile info comes from second run, i.e. tuning results are already stored in the database file.

Then you called one forward transform in addition to creating plan, which internally calls

one forward and one backward.

By the way, the code performs 2-D FFT…since forwardy0 is a CUDA kernel for dimension Y.

The library will be available for Windows, however the timeline is unknown.

I’m planning to use Visual Studio 2010 Pro if it is supported by CUDA.

The code is already updated for CUDA 3.2, however it still has a problem with the latest driver.

It will be available after the release of updated driver.

The library will be available for Windows, however the timeline is unknown.

I’m planning to use Visual Studio 2010 Pro if it is supported by CUDA.

The code is already updated for CUDA 3.2, however it still has a problem with the latest driver.

It will be available after the release of updated driver.

Can you please elaborate??? I used my 2D FFT batch transform this way. Did I miss something?

nufft_plan plan_forward1;

	nufftPlan2d(&plan_forward1, pix1, pix2, n, in1_d, in1_d, f1_d, f1_d, NUFFT_D2D);

	nufftExec(plan_forward1, in1_d, in1_d, f1_d, f1_d, NUFFT_FORWARD);

	nufftDestroy(plan_forward1);

Can you please elaborate??? I used my 2D FFT batch transform this way. Did I miss something?

nufft_plan plan_forward1;

	nufftPlan2d(&plan_forward1, pix1, pix2, n, in1_d, in1_d, f1_d, f1_d, NUFFT_D2D);

	nufftExec(plan_forward1, in1_d, in1_d, f1_d, f1_d, NUFFT_FORWARD);

	nufftDestroy(plan_forward1);

Since you wrote ‘a 400 point transform with batch size of 50’, I thought it is 400-points batched 1-D FFT…

Maybe the second dimension ‘pix2’ is very small… Otherwise the results will be wrong.

Planning and Exec functions have four arguments of buffers.

They should be (inout, inout, work1, work2) or (in, out, work1, work2).

work2 can be same as out. work1 can be same as in or inout.

Also work1 and work2 should be device memory for high performance.

Since you wrote ‘a 400 point transform with batch size of 50’, I thought it is 400-points batched 1-D FFT…

Maybe the second dimension ‘pix2’ is very small… Otherwise the results will be wrong.

Planning and Exec functions have four arguments of buffers.

They should be (inout, inout, work1, work2) or (in, out, work1, work2).

work2 can be same as out. work1 can be same as in or inout.

Also work1 and work2 should be device memory for high performance.

Good news…

Visual Studio 2010 is doable but not out of the box you must install Windows 7 SDK which will install Windows VC9 compilers and then select the project to use this compilers…

See Nsight 1.5 KB notes…

Good news…

Visual Studio 2010 is doable but not out of the box you must install Windows 7 SDK which will install Windows VC9 compilers and then select the project to use this compilers…

See Nsight 1.5 KB notes…

I am sorry for the confusion pix1 = 20, pix2 = 20, n = 50 (batch size), in1_d(input) and f1_d(FFT output) and device values with size of pix1pix2n

The values match, but I would like to see if there’s an improvement in speed :)

nufft_plan plan_forward1;

nufftPlan2d(&plan_forward1, pix1, pix2, n, in1_d, in1_d, f1_d, f1_d, NUFFT_D2D);

nufftExec(plan_forward1, in1_d, in1_d, f1_d, f1_d, NUFFT_FORWARD);

nufftDestroy(plan_forward1);