Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA

I see thank you very much for the explanation! I’ve wondered why CPUs don’t use a L1-cache/shared memory combined approach, and let the programmer explicitly place data in the cache. It seems to be very helpful to have both automatically HW managed cache and programmer controlled cache, like shared memory in GPUs so that when we do need explicit cache control it’s at our disposal. Any reasons CPUs are not designed that way?