Improving code execution speed using NVIDIA H100 for training Spiking Neural Network

amrutha.rk · January 23, 2024, 12:01pm

Description

Hello,

I was looking for some guidance on how to make the code execute faster using a GPU.
I am training a fully connected network for Spiking Neural Network using Cuda (Python).
And hence, there are large arrays of weights that I need to train. And I am using NVIDIA H100 to speed up the execution.

I have pasted the piece of code I am trying to execute below,

tmp_diff_loss_Whid = np.zeros(2500*20230, dtype=‘f’)

dev_Vin = cuda.to_device(Vin)
dev_Vhid = cuda.to_device(Vhid)
dev_tmp_diff_loss_Whid = cuda.device_array_like(tmp_diff_loss_Whid)
@cuda.jit
def diff_loss_Whid_calculation(Vhid, Vin, tmp_diff_loss_Whid):
count = cuda.grid(1)
for count in range(50575000):
k = math.ceil(count/2500)
i = k - 1
n = ((count - ((2500) * k)) - 1)
size_temp = len(Vin[i])
tmp = m = 0
for m in range(size_temp):
tmp += Vin[i][m] * ((1 / math.pi) * (1 / ( 1 + ((Vhid[n][m] * math.pi) * (Vhid[n][m] * math.pi)))))
m += 1
tmp_diff_loss_Whid[count] = tmp

threadsperblock = 1024
blockspergrid_x = int(math.ceil(50575000/ threadsperblock))
blockspergrid = (blockspergrid_x)

diff_loss_Whid_calculation[blockspergrid, threadsperblock](dev_Vhid, dev_Vin, dev_tmp_diff_loss_Whid)
tmp_diff_loss_Whid = dev_tmp_diff_loss_Whid.copy_to_host()

The above-mentioned code takes more than 8 hours to compute. But if I reduce the loop size, i.e., from 50575000 to 100000 it takes nearly 75mins.
CPU execution seems to be faster than GPU execution.
I think, my code is lacking in some way that I am not aware, hence I am not able to get maximum efficiency.

I would really appreciate receiving any suggestions on my question.

Thank you,
kind regards,
Amrutha

mnicely · January 23, 2024, 10:19pm

First, it’s easier to read if you can format and post your code with block format.

IIUC, you’re trying to launch a grid of 50,575,000 threads. Then in your kernel you’re asking each thread to execute a for 50,575,000… Is that what you really want to do??? Remember that when you are writing a CUDA kernel that kernel is running on each thread in the grid. You should using count = cuda.grid(1), the absolute position of the thread in the grid to assign a portion of the work. I can’t tell want’s going on after that without formatting.

I highly suggest you review the Numba documentation and take the Fundamentals of Accelerated Computing with CUDA Python before continuing.

amrutha.rk · January 25, 2024, 7:10pm

Thank you, @mnicely , I could rectify the issue and proceed. :)

system · February 8, 2024, 7:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	495	November 19, 2018
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10270	June 5, 2009
Unexpected CUDA processing time dependency on thread count CUDA Programming and Performance cuda , python , numba	0	785	April 17, 2021
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	242	July 7, 2024
Function is much slower on GPU than on CPU CUDA Programming and Performance cuda	4	566	July 22, 2022
Kernel only runs fast with >512 threads, regardless of what they are actually doing CUDA Programming and Performance	3	1854	May 19, 2009
CUDA perormances CUDA Programming and Performance	10	7129	January 22, 2008
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python CUDA Programming and Performance	11	9612	May 16, 2024

Improving code execution speed using NVIDIA H100 for training Spiking Neural Network

Description

Related topics