approach to take for multiple small matrices


I have a problem that I know how to tackle via traditional GPGPU methods but aren’t so sure about with CUDA. I have a whole ton of 4x4 matrices that I want to batch together and perform a CG solve on each.
If I were doing this traditionally, I’d have a kernel which does the CG solve, and would look up my input data from a texture, dependant on the ouput fragment position.
What is the equivalent way of achieving this with CUDA? I’m afraid I’ve been spending too much time setting up hardware to have really sat down and looked at CUDA properly yet, so would appreciate a push in the right general direction.