Generic DGEMM implementation

Hi,

I just startet with CUDA and played around with CUBLAS. For my work, I need a generic (good performing :whistling: ) DGEMM implementation which I can modify, so I need the source code which is not possible for CUBLAS…

Do you know any “open” CUDA-DGEMM implementation I could look into? If not, do you have any tips for me, how to implement an efficient DGEMM?

:">

Best regards,

gemini

Source code of CUBLAS is available:
http://forums.nvidia.com/index.php?showtopic=59101

search for the posts of vvolkov, he has made the implementation in the later cublas version. it is for sgemm, but then you can rewrite it.