Description
Hello,
I was looking for some guidance on how to make the code execute faster using a GPU.
I am training a fully connected network for Spiking Neural Network using Cuda (Python).
And hence, there are large arrays of weights that I need to train. And I am using NVIDIA H100 to speed up the execution.
I have pasted the piece of code I am trying to execute below,
tmp_diff_loss_Whid = np.zeros(2500*20230, dtype=‘f’)
dev_Vin = cuda.to_device(Vin)
dev_Vhid = cuda.to_device(Vhid)
dev_tmp_diff_loss_Whid = cuda.device_array_like(tmp_diff_loss_Whid)
@cuda.jit
def diff_loss_Whid_calculation(Vhid, Vin, tmp_diff_loss_Whid):
count = cuda.grid(1)
for count in range(50575000):
k = math.ceil(count/2500)
i = k - 1
n = ((count - ((2500) * k)) - 1)
size_temp = len(Vin[i])
tmp = m = 0
for m in range(size_temp):
tmp += Vin[i][m] * ((1 / math.pi) * (1 / ( 1 + ((Vhid[n][m] * math.pi) * (Vhid[n][m] * math.pi)))))
m += 1
tmp_diff_loss_Whid[count] = tmp
threadsperblock = 1024
blockspergrid_x = int(math.ceil(50575000/ threadsperblock))
blockspergrid = (blockspergrid_x)
diff_loss_Whid_calculation[blockspergrid, threadsperblock](dev_Vhid, dev_Vin, dev_tmp_diff_loss_Whid)
tmp_diff_loss_Whid = dev_tmp_diff_loss_Whid.copy_to_host()
The above-mentioned code takes more than 8 hours to compute. But if I reduce the loop size, i.e., from 50575000 to 100000 it takes nearly 75mins.
CPU execution seems to be faster than GPU execution.
I think, my code is lacking in some way that I am not aware, hence I am not able to get maximum efficiency.
I would really appreciate receiving any suggestions on my question.
Thank you,
kind regards,
Amrutha