Justifying the Behavior of a Memory Bounded Algorithm


I am trying to improve a matrix decomposition algorithm speed, but I am not sure how to justify its behavior or concentrate on what to able to justify.

The baseline algorithm for comparison is a mix of two different algorithms. For matrix up to 20K is a recursive algorithm and for the rest, it is a blocked algorithm (in double precision).

Here I have 2 GPUs. Why A100 is different from RTX for large dimensions and it starting to be better? Also why it is dropping in middle?

are you using libraries such as cublas, cusparse? If not, your question doesn’t belong here, it belongs on the CUDA programming forum. (regardless, I’d be surprised if anyone can give constructive advice based on what you have shown so far).

I am using cublas also.
Can we consider the effect of memory type? Because these two graphics cards use different memories. One is very fast and with low bandwidth, the other is slower and has more bandwidth.

I have tried to reduce the data movement in memory. Of course, Volta and RTX charts are similar and only A100 is different.