open source CuBlas


I was wondering if Nvidia has any plans for making the source code of the CuBlas library available. This would seem to be in the interest of everyone, particularly given that not all Blas functions have yet been implemented; also optimization and interoperability with other Blas libraries will be important once they have. I realize that Nvidia isn’t particularly keen on open source, but in this case it’s apparent that none of the usual reasons for this policy (eg: protecting intellectual property) apply. [I’m assuming here that the library is implemented exclusively in Cuda C-, with no assembler tricks]


I suppose that by not replying, NVIDIA is undecided.
I would like to add my voice to that of noegenesis! If CuBlas is implemented in cuda-c then it would only be advantageous to NVIDIA that a few people start mucking about with it, trying to increase its performance or add features!

As a research student I believe that this move would be much appreciated all over academia.

We are going to make the source code of the CUBLAS and CUFFT libraries available.
The libraries are all written in Cuda, by the way.

Stay tuned!!!

I would also greatly appreciate the release of the CUBLAS source code. What are the latest news on this project ? Do you plan to make them available soon?

Yes, the source code for CUBLAS and CUFFT will be available soon.

Are these available at this time?

Source will be posted later this month

Does anyone have a link to the source code? I tried to google for it but appart from this thread, nothing interesting showed up. I couldn’t find it either in the downloads or the code examples.

Announcements & news :

It is posted in the ‘CUDA Announcements and News’ forum:

hehe, beat you by a minute ;)


 Looks like I am reopening a very ancient thread!!.. I was googling around for the CUBLAS source code and this was the only result which came up. And in the link given in this thread, it says that the CUBLAS source code has been removed.. Does anyone know what happened to them? Is it possible to get the source code at all.. It would be very very helpful..




I’m a phd student and I’m very interested in cublas surce code for my research: if someone has the code

released by nvidia, (and now removed for some reason) can send me a copy?

thank you very much

Gianluca Moro

It’s still available in the registered developer’s section.

Hi jimh,

Can you tell me where is the registered developer’s section. I did not find it in the forum.


Hi Guys,

Does anyone still have the source or know where to get it? I assume it was taken down ages ago.

Kind Regards,

Even if you can locate the sources, consider that CUDA hardware and software have changed a lot over the years. If you are looking for source code since you need a feature not currently supported by CUBLAS, consider filing a feature request through the bug reporting form (simply prefix the synopsis with “RFE:” to mark it as a feature request rather than a bug).

If you’re ok with Maxwell and no double precision support my gemm kernels are released under Apache2 here:

That code generally outperforms cublas across the board, and is sometimes 2-3x faster in some dimensions important for deep learning. The python lib is also a pretty slick gpu accelerated numpy implementation. It’s still a work in progress and I haven’t had much chance to fully document it yet. But it should have lots of application outside of deep learning.

Hello Josh, it’s a few years later. Out of high interest - did you somehow obtain manage to get, or have a guy visit you who dropped by accident an USB stick with some source codes of the CUBLAS?

As reverse engineering never has been my thing of course.

A GPU burn test i did do written by a cool guy who used matrix calculation function is based upon a profiler report printing how many double precision Tflops i’m getting here (1.0 and 0.9 for the Titan Z here after a second or 10). Yet that’s a number from ‘we from coca cola advise coca cola’.

Wanna count instructions by hand there and calculate my own numbers there. If you do a matrix multiplication O ( n ^ 3 ) i would guess that it’s simple C code and not lower level - as there is a world of optimizations possible prior to moving to lower echolon levels :)