Optimize problem regarding problem size

myname · May 24, 2011, 8:03pm

Hey there,
let’s assume I have a problem like a matrix multiplication: When the problem size is quite small, shared memory would be the most efficient way to calculate the resulting matrix, but is limited in size. So when the limit is exceeded one has to use global memory. Is there any easy way to do this in ONE code? I think the code for calling the kernel (blocksize, gridsize) also has to be adopted, right!?
Which package could I use to solve those things like matrix multiplication or vector reductions etc? Do they automatically use the faster memory for storing depending on the problem size?
Thanks!

tera · May 24, 2011, 8:56pm

Matrix multiplication can also be performed in tiles, which can be chosen small enough to still fit into shared memory regardless of the matrix size.

myname · May 24, 2011, 9:26pm

Oh okay, true, just was confused. So do you know a good package which does this performantly, I don’t want to reinvent the wheel always :)

avidday · May 25, 2011, 6:14am

CUDA ships with CUBLAS, which contains a pretty good gemm() implementation for matrix-matrix multiplication.

myname · May 25, 2011, 9:19am

Okay thanks so far. Probably better using it than implementing it myself. One would probably not get to the performance of a already implemented thing from nvidia :)

Topic		Replies	Views
A Question from Programming Massively Parallel Processors: A Hands-on Approach CUDA Programming and Performance cuda , kernel	0	640	September 28, 2021
How to improve performance when multiply two matrices with large data in CUDA ? CUDA Programming and Performance	5	3931	March 19, 2014
Matrix Multiplication with Shared Memory CUDA Programming and Performance	0	1351	September 28, 2009
Memory size in 'real problem' sizes?! CUDA Programming and Performance	6	6934	May 31, 2011
Using more shared memory does not show improvement CUDA Programming and Performance	0	363	November 18, 2020
How to implement shared memory of smaller size than problem? CUDA Programming and Performance	1	577	April 12, 2017
matrix multiplication for large matrices CUDA Programming and Performance	3	1589	August 22, 2011
Problems in deciding Gridsize & Blocksize for kernel CUDA Programming and Performance	13	8834	June 8, 2010
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3676	March 31, 2011
Shared Memory Access - Matrix Multiplication CUDA Programming and Performance	1	1044	October 24, 2015

Optimize problem regarding problem size

Related topics