for some of my work I have implemented a Vector/Matrix library in C++ (using SSE for optimization if available).
Now I want to try and code a cuda version of this library (using cublas) and I was wondering about a couple of things. So I want to ask a couple of questions before I start. For now this is just an exercise to get familiar with cuda so I don’t directly care about being faster than my SSE optimized code, but any comments like “Don’t write your library with cuda because…” are welcome.
Would it make any sense to port the smaller types (Vec2f, Vec3f, Vec4f, Matrix4f, etc.) to cuda? Or is the number of elements to small? I have kept myself from working on this, focusing first on my n-dimensional vector class. (where n is usually in terms of 1000s)
What would be a better approach: storing my values (float arrays) on host memory and copying it to the GPU for each operation? Or storing it on the GPU so I don’t have to copy it?
Thanks in advance,