Improving code execution speed using NVIDIA H100 for training Spiking Neural Network

Description

Hello,

I was looking for some guidance on how to make the code execute faster using a GPU.
I am training a fully connected network for Spiking Neural Network using Cuda (Python).
And hence, there are large arrays of weights that I need to train. And I am using NVIDIA H100 to speed up the execution.

I have pasted the piece of code I am trying to execute below,


tmp_diff_loss_Whid = np.zeros(2500*20230, dtype=‘f’)

dev_Vin = cuda.to_device(Vin)
dev_Vhid = cuda.to_device(Vhid)
dev_tmp_diff_loss_Whid = cuda.device_array_like(tmp_diff_loss_Whid)
@cuda.jit
def diff_loss_Whid_calculation(Vhid, Vin, tmp_diff_loss_Whid):
count = cuda.grid(1)
for count in range(50575000):
k = math.ceil(count/2500)
i = k - 1
n = ((count - ((2500) * k)) - 1)
size_temp = len(Vin[i])
tmp = m = 0
for m in range(size_temp):
tmp += Vin[i][m] * ((1 / math.pi) * (1 / ( 1 + ((Vhid[n][m] * math.pi) * (Vhid[n][m] * math.pi)))))
m += 1
tmp_diff_loss_Whid[count] = tmp

threadsperblock = 1024
blockspergrid_x = int(math.ceil(50575000/ threadsperblock))
blockspergrid = (blockspergrid_x)

diff_loss_Whid_calculation[blockspergrid, threadsperblock](dev_Vhid, dev_Vin, dev_tmp_diff_loss_Whid)
tmp_diff_loss_Whid = dev_tmp_diff_loss_Whid.copy_to_host()


The above-mentioned code takes more than 8 hours to compute. But if I reduce the loop size, i.e., from 50575000 to 100000 it takes nearly 75mins.
CPU execution seems to be faster than GPU execution.
I think, my code is lacking in some way that I am not aware, hence I am not able to get maximum efficiency.

I would really appreciate receiving any suggestions on my question.

Thank you,
kind regards,
Amrutha

First, it’s easier to read if you can format and post your code with block format.

IIUC, you’re trying to launch a grid of 50,575,000 threads. Then in your kernel you’re asking each thread to execute a for 50,575,000… Is that what you really want to do??? Remember that when you are writing a CUDA kernel that kernel is running on each thread in the grid. You should using count = cuda.grid(1), the absolute position of the thread in the grid to assign a portion of the work. I can’t tell want’s going on after that without formatting.

I highly suggest you review the Numba documentation and take the Fundamentals of Accelerated Computing with CUDA Python before continuing.

Thank you, @mnicely , I could rectify the issue and proceed. :)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.