using cuda/cublas in a vector/matrix library

Hi all,

for some of my work I have implemented a Vector/Matrix library in C++ (using SSE for optimization if available).
Now I want to try and code a cuda version of this library (using cublas) and I was wondering about a couple of things. So I want to ask a couple of questions before I start. For now this is just an exercise to get familiar with cuda so I don’t directly care about being faster than my SSE optimized code, but any comments like “Don’t write your library with cuda because…” are welcome.

Would it make any sense to port the smaller types (Vec2f, Vec3f, Vec4f, Matrix4f, etc.) to cuda? Or is the number of elements to small? I have kept myself from working on this, focusing first on my n-dimensional vector class. (where n is usually in terms of 1000s)

What would be a better approach: storing my values (float arrays) on host memory and copying it to the GPU for each operation? Or storing it on the GPU so I don’t have to copy it?

Thanks in advance,


In general, the more work you can assemble in one batch to process, the better.

So your Vec3f etc would directly correspond to CUDAs float3 type. The big thing to watch out for is to have enough work to do compared to the data transfer to/from the card which is the slowest thing that can happen. So if you have several “small problems” to compute like Vec3 * Matrix3, batch 10000 of them together and transfer and process the block. A transfer for every one would be devastating slow and would leave many GPU processors idle.


Exactly. That’s what I thought. Just needed to confirm it though. :)

So I will continue thinking about/porting my dense and sparse matrix code as well as n-dim vector code to CUDA to see what it offers me.

Assembling lots of tiny computations in a large batch is something that’s more applciation dependent, so I will get to that later. Fun fun fun. :)