Kernel launch performance

Hello,

suppose i want to calculate the SVD of N 4x4 Matrices. In general what do you think would be quicker, to calculate the SVDs on CPU (Dual Core 3GHz) or GPU. Whereas the GPU computations do not need a memory transfer.

Basically, will it be worth invoking the kernel N times?

Thanks in advance for your help.

If N is large enough it can be worth computing on the GPU.

You might need N to be in the order of roughly ~65536 before you reach break even.

ONLY INVOKE THE KERNEL ONCE. Have several matrices per block.

2 4x4 matrices could be coalsced in one read by 32 threads if stored linearly in memory.

N is roughly = 1.3 million.

The problem is i am using a library for the SVD computation, namely CULA. And it only allows a single SVD to be computed in one kernel.

Yes, unfortunately I don’t think CULA will do you any good here.