I’m working on a problem where I have to calculate covariance matrices for a large number of timeseries, each covariance matrix will be of the size 4 x 4 and I want to do it for something like 100 000 voxels. My original data is 4D and can for example have the resolution 64 x 64 x 32 x 80, meaning that I have 80 timesamples to sum over where each timesample is a volume. After some processing I have 4 datasets of the same size, and I want to calculate the covariance matrix in each voxel, i.e. sum over 80 timesamples C += x * x’ where x is a 4 x 1 vector in each timesample.

How do I do this in the best way with CUDA?

I guess that I should try to read data into the shared memory and then calculate a matrix for each timesample and sum over the timesamples. The small size of the shared memory however makes this difficult. The resulting covariance matrix can be stored as registers in each thread (if I don’t use variables to index the array). If I for example use 128 threads per block and store 8 timesamples from each of the 4 datasets I would use all of the shared memory (128 * 8 * 4 * 4 bytes per float = 16 KB) with one block and I guess that the multiprocessor then will be underutilized. If I instead use 64 threads I can store 16 timesamples and then get coalesced reads in chunk of 16 samples at the time (instead of 8) but then only 64 threads run on the multiprocessor. I also have to repeat this summing by reading data into the shared memory until I have summed all the timesamples, i.e. 5 times if I load 16 timesamples per time and have 80 timesamples.

CUBLAS is designed for matrix multiplications, but can I call CUBLAS from each thread in order to do small matrix multiplications?