Generic DGEMM implementation


I just startet with CUDA and played around with CUBLAS. For my work, I need a generic (good performing :whistling: ) DGEMM implementation which I can modify, so I need the source code which is not possible for CUBLAS…

Do you know any “open” CUDA-DGEMM implementation I could look into? If not, do you have any tips for me, how to implement an efficient DGEMM?


Best regards,


Source code of CUBLAS is available:

search for the posts of vvolkov, he has made the implementation in the later cublas version. it is for sgemm, but then you can rewrite it.