Kernel launch performance

tdhd · May 11, 2011, 5:15pm

Hello,

suppose i want to calculate the SVD of N 4x4 Matrices. In general what do you think would be quicker, to calculate the SVDs on CPU (Dual Core 3GHz) or GPU. Whereas the GPU computations do not need a memory transfer.

Basically, will it be worth invoking the kernel N times?

Thanks in advance for your help.

Jimmy_Pettersson · May 11, 2011, 8:57pm

If N is large enough it can be worth computing on the GPU.

You might need N to be in the order of roughly ~65536 before you reach break even.

ONLY INVOKE THE KERNEL ONCE. Have several matrices per block.

2 4x4 matrices could be coalsced in one read by 32 threads if stored linearly in memory.

tdhd · May 11, 2011, 9:29pm

N is roughly = 1.3 million.

The problem is i am using a library for the SVD computation, namely CULA. And it only allows a single SVD to be computed in one kernel.

Jimmy_Pettersson · May 11, 2011, 9:34pm

Yes, unfortunately I don’t think CULA will do you any good here.

Topic		Replies	Views
SVD for rectangular matrices CUDA Programming and Performance	1	4252	July 8, 2009
Calculate many SVDs of small matrices CUDA Programming and Performance	9	4203	August 2, 2015
Singular Value Decomposition (SVD) CUDA Programming and Performance	8	17490	September 14, 2018
cuda 7.0 -- many small parallel svds in MATLAB CUDA Programming and Performance	6	3210	April 1, 2015
Simple SVD for CUDA CUDA Programming and Performance	14	75976	October 18, 2009
Python: Use thread indexing as well as linear algebra stuff CUDA Programming and Performance	7	632	October 30, 2020
many small eigen/singular value decomposition? CUDA Programming and Performance	6	6055	July 11, 2008
LAPACK + CUBLAS CUDA Programming and Performance	6	9307	July 8, 2008
Who can help with SVD on GPUs? CUDA Programming and Performance	3	2258	February 22, 2011
Batched svd on cuda surprisingly slow CUDA Programming and Performance	1	1464	January 22, 2020

Kernel launch performance

Related topics