- Hello, CUDA experts! I’m working on accelerating an ODE solver over multiple nodes by leveraging GPU parallelism. However, this is my first try at gpu computing and my implementation is not faster than the equivalent numpy code. Here’s a summary of my approach:
-
I’m using a kernel to compute derivatives (derivT) and another for the Euler update (euler_update).
-
The grid and block dimensions are set to use n threads per block (I tried different values of n) Questions:
-
I don’t think the way I wrote the kernels are the best. The idea was to apply each node to a thread. Is there a better way of coding the kernel functions to calculate the derivative at each timestep?
-
Are there other CUDA-specific optimizations that could improve the runtime?
-
Finally is there any python examples for nnumerical integration of a systems of ODE coded with cuda? I couldn’t find many papers or tutorials online for this kind of problem.
Here is the relevant code (simplified for clarity. To learn cuda, I’m using lotka-voltera equations as a toy model here. The real model I’m working on has ~20 ODEs for each node and there is connectivity between nodes, with stochasticity on some ODEs.):
@cuda.jit
def euler_update(y, dydx, dt):
j = cuda.grid(1)
if j < y.shape[1]:
y[0, j] += dydx[0, j] * dt[0]
y[1, j] += dydx[1, j] * dt[0]
@cuda.jit
def derivT(y, dydx, params):
# A = params[0]
# B = params[1]
# C = params[2]
# D = params[3]
j = cuda.grid(1)
if j < y.shape[1]:
dydx[0, j] = y[0, j] * (params[0] - params[1] * y[1, j])
dydx[1, j] = y[1, j] * (params[2] * y[0, j] - params[3])
def ode_solver(y, dydx, dt, all_outputs, params):
#grid dimensions:
threads_per_block = 256
blocks_per_grid = (y.shape[1] + threads_per_block - 1) // threads_per_block
for i in range(nb_steps):
derivT[blocks_per_grid, threads_per_block](y, dydx, params)
euler_update[blocks_per_grid, threads_per_block](y, dydx, dt)
all_outputs[i,:, :] = y[:, :]
return all_outputs
#initializing:
#parameters:
#A, B, C, D, nbOdes, nb_nodes, dt, nb_steps
A = 0.66
B = 1.33
C = 1.
D = 1.
nbOdes = 2 #number of ODEs for each node
nb_nodes = 100000 #number of nodes
dt = np.array([1. / 1024.], dtype=np.float32)
nb_steps = 5 * int(1 / dt[0]) #length of the integration (in seconds)
#arrays :
params = np.array([A, B, C, D], dtype = np.float32)
y = np.ones((nbOdes,nb_nodes), dtype = np.float32)
dydx = np.ones((nbOdes,nb_nodes), dtype = np.float32)
all_outputs = np.zeros((nb_steps, nbOdes, nb_nodes), dtype = np.float32) #recording values of all y during the simulation
#initialize arrays:
y_device = cuda.to_device(y)
dydx_device = cuda.to_device(dydx)
params_device = cuda.to_device(params)
all_outputs_device = cuda.to_device(all_outputs)
dt_device = cuda.to_device(dt)
results = ode_solver(y_device, dydx_device, dt_device, all_outputs_device, params_device)
Best