I have evaluated the performance of the grafic card by comparing the computation time of the CPU and of the NVIDIA using matrixmul and simplecublas. Additionally the size of the matrix is variable, so that one can clearly see the speed up when using the grafic card in case of bigger matrices.
However the computation time of matrixmul is always greater than simplecublas. Why is that? (One possible answer might be that the cublas implementations are optimized, btu i am not sure)
If one has a closer on the occupancy calculator in case of matrixmul with threads/Block = 256, registers/thread = 14, shared mem/block = 2048, the percentage of occupancy of each mutli equals 67%.
As i understand, there are mainly three options in order to increase the performance: changing the thread block dimensions, the shared mem size or the number or registers. In the case of matrixmul, the number of registers used per thread causes in this case a bottle neck.
What chance do i have in order to increase the occupancy although i dont have any influence on the number of register (as the compiler tries to increase the number of threads while using less registers)?
Although i have not changed any line of matrixmul, the program crashes when the matrixdimension increases to 5120x5120. Memory allocation works fine. I dont have a clue, why the program is crashing :blink: .
Thx in advance for your help.