Question: How to performance-optimize this CUDA (numba) code?
Background: I am coding a least-squares fitter to determine the best signal shape in 1D data. Due to the non-converging nature of the problem, algorithms such as Levenberg–Marquardt can not be used. Instead, I compare a set of 1D signal shapes to the 1D data by sliding (signals are shorter than the data) and least squares calculations, to determine the best combination of position and signal. The code runs nice with Python + numba on the CPU. The amount of data to be analyzed, however, is astronomical. So I try to leverage my GTX 1060 using CUDA. This works in 1D:
@cuda.jit
def cuda1(data, signal, dy, chi2):
i = cuda.grid(1)
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
xstride = cuda.gridsize(1)
for i in range(i, chi2.shape[0], xstride):
value = 0
for j in range(len(signal)):
value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
chi2[i] = value
And this version in 2D
@cuda.jit
def cuda2(data, signal, dy, chi2map):
startX = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x
startY = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
for x in range(startX, chi2map.shape[0], gridX):
for y in range(startY, chi2map.shape[1], gridY):
chi2map[x,y] = ((data[x+y]-signal[y])**2) * dy[x+y]
For reference, the CPU version:
@numba.jit(fastmath=True, parallel=False, cache=True, nopython=True)
def cpu(data, signal, dy):
chi2 = numpy.zeros(outer_loop_length)
for i in range(outer_loop_length):
value = 0
for j in range(len(signal)):
value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
chi2[i] = value
return chi2
When I benchmark these with typical problem sizes
data_length = 50000
signal_length = 5000
excluding the time to transfer data back and forth (this will be re-used and thus negligible) I find that the CPU is faster by a factor of ~2:
- Time cuda1: 0.069
- Time cuda2: 0.084
- Time CPU: 0.042
Are there obvious performance blockers? Have I already reached the limit? For my real data, I will have to run such an algorithm ~500 times on the same data (with changing “signal”), and then again ~10^3 times on different data for all signals. So speed is really important.
Thank you so much in advance for your help. The code to run the benchmark is as follows:
data = numpy.full(data_length, 3, dtype='float32')
data_cuda = cuda.to_device(data)
dy = numpy.full(data_length, 2, dtype='float32')
dy_cuda = cuda.to_device(dy)
signal = numpy.full(signal_length, 1, dtype='float32')
signal_cuda = cuda.to_device(signal)
outer_loop_length = data_length - signal_length + 1
C_global_mem = cuda.device_array(outer_loop_length, dtype='float32')
# Run cuda1 once to compile
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t1 = time.perf_counter()
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
chi2 = numpy.array(C_global_mem, dtype=numpy.float32)
t2 = time.perf_counter()
print('Time cuda1', t2-t1)
print('Result to check:', numpy.sum(chi2))
# Run cuda2
chi2map = numpy.zeros((outer_loop_length, signal_length), dtype='float32')
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, chi2map)
C_global_mem = cuda.device_array((outer_loop_length, signal_length), dtype='float32')
t1 = time.perf_counter()
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t2 = time.perf_counter()
chi2map = numpy.array(C_global_mem, dtype=numpy.float32)
row_sums = numpy.sum(chi2map, axis=1)
print('Time cuda2', t2-t1)
print('Result to check:', numpy.sum(row_sums))
# CPU version
result = cpu(data, signal, dy)
t1 = time.perf_counter()
result = cpu(data, signal, dy)
t2 = time.perf_counter()
print('CPU Time', t2-t1)
print('Result to check:', numpy.sum(result))