Large matrix multiplication for neural network purpose

EdouardS · October 1, 2016, 5:35pm

Hello everyone,
I am a new user of CUDA and I am working on a neural project in which I would need your help. To simplify the problem, let’s say I have one matrix A(100,500), one matrix B(100,500), the weight matrix W(100,100,100) and the resulting matrix C(100,100).

The Matlab code below represents one iteration i :

for k = 1:1:100
    C = diag(A*W(:,:,k)*B') 
end

I need to compute 10 Millions of iterations (i=1 to 10e6). The matrix W won’t change from one iteration to another whereas A and B will (A=fct(i), B=fct(i), C=fct(i)). These 10 Millions iterations will give me the possibility to calculate the objective function and I will update the weight matrix (W) thanks to a gradient descent algorithm.

First of all, I was wondering if GPU programming could speed up the problem as it is highly parallelizable. Indeed, if I use my CPU, I will spend hours/days just to evaluate one step of the objective function…
For example, I have a GeForce 740M with 384 cores and 2GB of memory. Would it be possible that each core execute the algorithm I just wrote above ? Or should I proceed a different way ?

Thank you very much for your help.

Reve · October 6, 2016, 12:34pm

Well, TBH you should look into some general CUDA tutorials.

That being said, you probably don’t wan’t to implement matrix multiplication yourself. The CUDA toolkit already contains some libraries for that e.g. https://developer.nvidia.com/cublas. Note: I have never used it.

If you still want to implement it yourself:

Naive way: Each thread in the kernel executes 1 multiplication. However, the memory access is not optimal (look-up bank conflicts). This should already give you a speed-up, depending on cpu/gpu.
Use shared memory to align read/writes. If you looked into bank conflicts, you should get an idea how to improve the memory access by using shared memory.

Topic		Replies	Views
Is to possible to speed up multiple matrix per vector multiplication using CUDA? CUDA Programming and Performance	2	1406	April 12, 2010
A few Questions related to CUDA and CUBLAS CUDA Programming and Performance	0	909	February 1, 2013
Massive matrix calculation issue CUDA Programming and Performance	0	671	May 16, 2011
Mid-size matrix arithmetic in Cuda Is cuda appropriate? CUDA Programming and Performance	3	2663	June 6, 2008
matrix multiplication with large dimensions CUDA Programming and Performance	7	1580	April 9, 2011
plain matrix multiplication very slow (implemented with Matlab mex) CUDA Programming and Performance	4	1652	August 1, 2016
slow kernel CUDA Programming and Performance	4	1447	June 25, 2009
matrix multiplication for large matrices CUDA Programming and Performance	3	1579	August 22, 2011
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	764	August 26, 2018
Matrix multiplication of many small-sized matrices CUDA Programming and Performance	3	1402	March 30, 2020

Large matrix multiplication for neural network purpose

Related topics