non square matrix mul

Hi everyone, im having problems with understanding matrix multiplication on non square matrices.

I am using dgemm & sgemm on my gpu, I multiply [4x4] by [4xv] where v : 1000 to 50000000 .

for all v values less than 32768 the exe time is about 0.25 m sec., whereas if v is 32768+1, it takes about 0.445 m sec. this is for double precision,
is that limits on the L2 cache when store two rows of double 32768 (2327688=524288 byte ) and (2655364=524288 byte for single precision)?

i wounder if im right in these predictions ???

the second question is the limit about the max. no. of threads that can be execute at a time ??
(2048 thread/SM *5 SM= 10240 threads, if each thread responsible for computing 1 output element of [4xv], that mean v=10240/4= 2560 !
im confused in these calculations any help pls…

Device 0: “GeForce GTX 1050”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4096 MBytes (4294967296 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1493 MHz (1.49 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024

I don’t think there’s anything special about 2 rows. I’m pretty sure the matrix would not be divided that way. What seems likely to me is that cublas under the hood is switching to a different kernel sequence for the different matrix size. You could help confirm this by using a GPU profiler.

Its not even clear that you’ve actually stated a question in the second case. Questions usually end in a question mark. However you probably have some misconception about the GPU computing model. It’s true that the maximum number of threads that can execute in any clock cycle is given by your equation, but that doesn’t mean that a GPU code can’t consist of more threads. As old threads finish/retire, new threads can start.