cuBLAS for lower-end GPUs

I’m implementing some machine learning algorithms on CPU, and they are quite intensive on matrix computations. I intended to do an alternative implementation on GPU using cuBLAS and compare performance. However, the whole system must run on rather low-end platforms (cuda compute capabilities 2.1) and I have found that cuBLAS requires higher capabilities. If such is the case, I’d like to implement my own cuda kernels for the BLAS subset I use, and I intend to make it api-compatible with cuBLAS. Now the question is: I read in older topics in this forum, that the source for cuBLAS was available, but it seems to no longer be available. Can I find it somewhere to use it as a reference, or is there any other resource that can help me with implementation?

Thanks in advance.

CUBLAS can work on a cc 2.0 or higher GPU.