Hi, I’ve read many articles about optimizing matmul in CUDA, and all of them implement an algorithm like this: first load a tile of matrix A, and a tile of matrix B to shared memory, and the load from shared memory for the actual computation. I was wondering, for the same tiled implementation, is i…

Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA

Accelerated Computing CUDA CUDA Programming and Performance

zt9465 June 18, 2023, 12:43am 3

I see thank you very much for the explanation! I’ve wondered why CPUs don’t use a L1-cache/shared memory combined approach, and let the programmer explicitly place data in the cache. It seems to be very helpful to have both automatically HW managed cache and programmer controlled cache, like shared memory in GPUs so that when we do need explicit cache control it’s at our disposal. Any reasons CPUs are not designed that way?

Topic		Replies	Views
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23521	March 21, 2011
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1159	April 26, 2013
life span of shared memory CUDA Programming and Performance	15	6952	April 27, 2011
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	2526	July 11, 2022
CUDA Refresher: The CUDA Programming Model Technical Blog	2	655	January 26, 2023
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5980	September 5, 2008
Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!& CUDA Programming and Performance	2	1437	April 13, 2009
paging stratigies for global memory any paging strategy on the way for CUDA CUDA Programming and Performance	3	2218	November 26, 2008
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3667	March 31, 2011

Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA

Related topics