Large matrix multiplication for neural network purpose

Hello everyone,
I am a new user of CUDA and I am working on a neural project in which I would need your help. To simplify the problem, let’s say I have one matrix A(100,500), one matrix B(100,500), the weight matrix W(100,100,100) and the resulting matrix C(100,100).

The Matlab code below represents one iteration i :

for k = 1:1:100
    C = diag(A*W(:,:,k)*B') 
end

I need to compute 10 Millions of iterations (i=1 to 10e6). The matrix W won’t change from one iteration to another whereas A and B will (A=fct(i), B=fct(i), C=fct(i)). These 10 Millions iterations will give me the possibility to calculate the objective function and I will update the weight matrix (W) thanks to a gradient descent algorithm.

  1. First of all, I was wondering if GPU programming could speed up the problem as it is highly parallelizable. Indeed, if I use my CPU, I will spend hours/days just to evaluate one step of the objective function…

  2. For example, I have a GeForce 740M with 384 cores and 2GB of memory. Would it be possible that each core execute the algorithm I just wrote above ? Or should I proceed a different way ?

Thank you very much for your help.

Well, TBH you should look into some general CUDA tutorials.

That being said, you probably don’t wan’t to implement matrix multiplication yourself. The CUDA toolkit already contains some libraries for that e.g. https://developer.nvidia.com/cublas. Note: I have never used it.

If you still want to implement it yourself:

  1. Naive way: Each thread in the kernel executes 1 multiplication. However, the memory access is not optimal (look-up bank conflicts). This should already give you a speed-up, depending on cpu/gpu.
  2. Use shared memory to align read/writes. If you looked into bank conflicts, you should get an idea how to improve the memory access by using shared memory.