CUDA ODE Solver Not Outperforming NumPy – Need Advice on Performance Optimization

  1. Hello, CUDA experts! I’m working on accelerating an ODE solver over multiple nodes by leveraging GPU parallelism. However, this is my first try at gpu computing and my implementation is not faster than the equivalent numpy code. Here’s a summary of my approach:
  • I’m using a kernel to compute derivatives (derivT) and another for the Euler update (euler_update).

  • The grid and block dimensions are set to use n threads per block (I tried different values of n) Questions:

  • I don’t think the way I wrote the kernels are the best. The idea was to apply each node to a thread. Is there a better way of coding the kernel functions to calculate the derivative at each timestep?

  • Are there other CUDA-specific optimizations that could improve the runtime?

  • Finally is there any python examples for nnumerical integration of a systems of ODE coded with cuda? I couldn’t find many papers or tutorials online for this kind of problem.

Here is the relevant code (simplified for clarity. To learn cuda, I’m using lotka-voltera equations as a toy model here. The real model I’m working on has ~20 ODEs for each node and there is connectivity between nodes, with stochasticity on some ODEs.):

@cuda.jit
def euler_update(y, dydx, dt):
    j = cuda.grid(1)
    if j < y.shape[1]:
        y[0, j] += dydx[0, j] * dt[0]
        y[1, j] += dydx[1, j] * dt[0]

@cuda.jit
def derivT(y, dydx, params):
    # A = params[0]
    # B = params[1]
    # C = params[2]
    # D = params[3]

    j = cuda.grid(1)
    if j < y.shape[1]:

        dydx[0, j] = y[0, j] * (params[0] - params[1] * y[1, j])
        dydx[1, j] = y[1, j] * (params[2] * y[0, j] - params[3])

def ode_solver(y, dydx, dt, all_outputs, params):
    
    #grid dimensions:
    threads_per_block = 256
    blocks_per_grid = (y.shape[1] + threads_per_block - 1) // threads_per_block

    for i in range(nb_steps):

        derivT[blocks_per_grid, threads_per_block](y, dydx, params)

        euler_update[blocks_per_grid, threads_per_block](y, dydx, dt)
        all_outputs[i,:, :] = y[:, :]
    return all_outputs

#initializing:
#parameters:
#A, B, C, D, nbOdes, nb_nodes, dt, nb_steps
A = 0.66
B = 1.33
C = 1.
D = 1.
nbOdes = 2 #number of ODEs for each node
nb_nodes = 100000 #number of nodes
dt = np.array([1. / 1024.], dtype=np.float32)

nb_steps = 5 * int(1 / dt[0]) #length of the integration (in seconds)



#arrays :
params = np.array([A, B, C, D], dtype = np.float32)
y = np.ones((nbOdes,nb_nodes), dtype = np.float32)
dydx = np.ones((nbOdes,nb_nodes), dtype = np.float32)
all_outputs = np.zeros((nb_steps, nbOdes, nb_nodes), dtype = np.float32) #recording values of all y during the simulation



#initialize arrays:
y_device = cuda.to_device(y)
dydx_device = cuda.to_device(dydx)
params_device = cuda.to_device(params)
all_outputs_device = cuda.to_device(all_outputs)
dt_device = cuda.to_device(dt)

results = ode_solver(y_device, dydx_device, dt_device, all_outputs_device, params_device)

Best

That line seems unnecessary. euler_update already has y computed. If you want it stored in two different places, then pass all_outputs and i to euler_update, and have it do the writing:

@cuda.jit
def euler_update(y, dydx, dt, i, all_outputs):
    j = cuda.grid(1)
    if j < y.shape[1]:
        t = dydx[0,j] * dt[0] + y[0,j]
        y[0, j] = t
        all_outputs[i,0,j] = t
        t = dydx[1,j] * dt[0] + y[1,j]
        y[1, j] = t
        all_outputs[i,1,j] = t

This isn’t likely to make much of a perf difference, however. It will save you one read operation per element. Storing it in y and all_outputs is redundant. You could eliminate the use of y and save another write operation per element. Hopefully with the above example its evident how that could be done.

I’m not sure why you think it is not best. Your work breakdown structure (thread strategy) is typical for CUDA, and you will get good coalescing on memory reads and writes because your usage of j in the last index position.

It’s possible that for a “width” of 100,000, you might have not full utilization of the GPU you are running on if it is a “big” GPU like A100. In that case, for example in euler_update, you could double the width of your work by computing y[0,j] in half of your threads, and y[1,j] in the second half of your threads, allowing you to launch twice as many threads. You could do something similar in derivT. Even if you did this, I doubt it would make much perf difference.