Hi everyone, im having problems with understanding matrix multiplication on non square matrices.

I am using dgemm & sgemm on my gpu, I multiply [4x4] by [4xv] where v : 1000 to 50000000 .

for all v values less than 32768 the exe time is about 0.25 m sec., whereas if v is 32768+1, it takes about 0.445 m sec. this is for double precision,

is that limits on the L2 cache when store two rows of double 32768 (2*32768*8=524288 byte ) and (2*65536*4=524288 byte for single precision)?

i wounder if im right in these predictions ???

the second question is the limit about the max. no. of threads that can be execute at a time ??

(2048 thread/SM *5 SM= 10240 threads, if each thread responsible for computing 1 output element of [4xv], that mean v=10240/4= 2560 !

im confused in these calculations any help pls…

Device 0: “GeForce GTX 1050”

CUDA Driver Version / Runtime Version 9.0 / 9.0

CUDA Capability Major/Minor version number: 6.1

Total amount of global memory: 4096 MBytes (4294967296 bytes)

( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores

GPU Max Clock rate: 1493 MHz (1.49 GHz)

Memory Clock rate: 3504 Mhz

Memory Bus Width: 128-bit

L2 Cache Size: 524288 bytes

Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)

Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers

Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 65536

Warp size: 32

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024