Justifying the Behavior of a Memory Bounded Algorithm

uniadam · September 9, 2022, 2:08pm

Hi,

I am trying to improve a matrix decomposition algorithm speed, but I am not sure how to justify its behavior or concentrate on what to able to justify.

The baseline algorithm for comparison is a mix of two different algorithms. For matrix up to 20K is a recursive algorithm and for the rest, it is a blocked algorithm (in double precision).

Here I have 2 GPUs. Why A100 is different from RTX for large dimensions and it starting to be better? Also why it is dropping in middle?

Robert_Crovella · September 9, 2022, 2:51pm

are you using libraries such as cublas, cusparse? If not, your question doesn’t belong here, it belongs on the CUDA programming forum. (regardless, I’d be surprised if anyone can give constructive advice based on what you have shown so far).

uniadam · September 9, 2022, 2:56pm

I am using cublas also.
Can we consider the effect of memory type? Because these two graphics cards use different memories. One is very fast and with low bandwidth, the other is slower and has more bandwidth.

I have tried to reduce the data movement in memory. Of course, Volta and RTX charts are similar and only A100 is different.

Topic		Replies	Views
why matrixMul from samples so slow? CUDA Programming and Performance	7	5146	June 7, 2010
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38786	July 13, 2007
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1492	February 12, 2010
Speed improvement CUDA Programming and Performance	18	8430	December 5, 2008
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4394	January 14, 2010
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10698	May 8, 2008
Is CUDA really that fast? CUDA Programming and Performance	17	11885	September 21, 2009
Matrix multiplication performance CUDA Programming and Performance	2	1148	August 3, 2013
Different run times depending on axis? CUDA Programming and Performance	0	4189	March 13, 2007
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11455	May 23, 2010

Justifying the Behavior of a Memory Bounded Algorithm

Related topics