System information: 3.6.2 |Anaconda custom (64-bit)| (default, Aug 15 2017, 11:34:02) [MSC v.1900 64 bit (AMD64)]
Hardware: GPU - NVIDIA GeForce GTX 1050 with 2 GB RAM
CPU - Intel Core i7-7700 HQ , 16 GB RAM

from numba import jit, cuda
import tifffile as tf
import numpy as np
from pyculib.fft.binding import Plan, CUFFT_C2C
#set parameter for matrix
nx = 512
ny = 512
nz = 390
kx = 0.2
ky = 0.2
kz = 0.1
#create matrix
x = np.arange(2*nx)
y = np.arange(2*ny)
z = np.arange(2*nz)
zv, yv, xv = np.meshgrid(z,y, x)
@jit(nopython=True, parallel=True)
def f(z,x,y):
a = np.exp(2j*np.pi*(kz*z+kx*x+ky*y)).astype(np.complex64)
return a
data = f(zv,xv,yv)
#perform fft
orig = data.copy()
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
d_data.copy_to_host(data)

Create the matrix costs very long time even using jit.
And the fft failed and have the error below:

Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED

Is there any way to optimize the matrix creation process to make it faster?
For the fft error, does it mean that, the GPU can calculate the fft but doesn’t have enough ram to pull it out to host? Any solution or suggestions?

It seems that the matrix creation happens on the host, which means you are looking at a Python performance issue, not a CUDA performance issue. Is that correct?

What’s the total size of the matrix (in GB), and how long (in milliseconds) does it take to create? The way I read the code, you have a matrix of 102,236,160 double-precision complex elements, so 1.52 GB all together. This matrix is initialized by an element-wise computation:

exp(2jpi(kzz+kxx+ky*y))

The computational density of this computation is low, and the matrix should be filled in as fast as memory can be written. Depending on what kind of system memory configuration your host system has (for now I am not going to look up the exact specs of your CPU), it provides anywhere from 25 GB/sec to 65 GB/sec throughput, so you are looking at 60 milliseconds or less.

Transferring the data to the GPU will happen at PCIe gen3 x16 speeds (assuming your system is configured correctly), which has a transfer rate of about 11-12 GB/sec for large transfers. So that would take about 140 ms. Altogether we would therefore expect that creation and transfer of the data to the GPU takes on the order of 200 milliseconds.

Generating the matrix data directly on the GPU itself would likely be a superior approach. I see that you are using Numba, which has some GPU support from what I understand (I have never used it). The GTX 1050 has only about 60 GFLOPS (double precision), so the problem becomes compute limited, with a resulting estimated initialization rate of maybe 2B elements per second, requiring 50 milliseconds of initialization time to initialize a 102M elements.

The FFTs you are doing may be too big for the 2 GB of memory on the GTX 1050 , I would suggest starting with smaller sizes.

I definitely want to try CUDA for creating the matrix, it should be faster than python with parallel evaluation the function on the matrix. I think the matrix is 512512390 (xyz) in complex number format, so is it larger than 1.52 GB? As I tested, creating that matrix cost 1 min 42 seconds, which is the first thing I want to optimize.
For the fft, I agree that it could be the problem of computing limit of hardware, but I am wondering whether there is a way to make it possible for my machine to compute.

512512390 = 102,236,160 elements. Each element comprises 16 bytes (double-precision complex). Total of 1,635,778,560 bytes = 1.5234 GB

If the matrix creation cost using Python is 1 min 42 seconds, that is probably not something the CUDA users in this forum can help with. You might want to take that issue to a forum dedicated to Python. If Python used a compiler, I would suggest to check check whether optimizations are turned on. But as far as I know Python is interpreted.

I checked your CPU’s specifications, and the system memory should be able to sustain 25 GB/sec throughput as I assumed. That would indicate that you are limited by host computation. Python may not utilize the SIMD floating-point unit of your CPU (and instead use scalar code for all computation), nor use more than one CPU core, that might explain it. When using an optimized math library with support for vectorized math functions, computation should not be the limiting factor by my back-of-the-envelope computation. Maybe you can link Python to better libraries, or maybe Python has configuration switches that allow you to make better use of your CPU.

So which GPU do you suggest if I reach out to get a new one? How to calculate the limitation that a GPU is able to compute? For example, I want to perform fft on a matrix of size (512,1024,1024), (z,x,y).

Note that the out-of-memory condition during FFT is just a working hypothesis for now.

Before you ponder any GPU purchases, I would suggest debugging the current code, using smaller matrices, until it works correctly. Once the code is functionally correct, experiment with different matrix sizes, starting with a small size and systematically increasing the size, while observing GPU memory usage. This should let you estimate with some certainty how much total GPU memory you need for the targeted matrix size.

As discussed in a different thread in this forum, if you are using Windows 10, you should not count on more than 81% of your GPU memory being available to user applications wehn using the default WDDM driver (by observation, that seems to be the substantial memory overhead imposed by the WDDM 2.0 driver model used by Windows 10).

I tried matrix of smaller scale, it works. That why I am thinking about purchasing a new graphic card.
Yes, I also notice that part of GPU RAM is reserved for Win 10, and someone is complaining about that in other forum.