Hi everyone, im having problems with understanding matrix multiplication on non square matrices.
I am using dgemm & sgemm on my gpu, I multiply [4x4] by [4xv] where v : 1000 to 50000000 .
for all v values less than 32768 the exe time is about 0.25 m sec., whereas if v is 32768+1, it takes about 0.445 m sec. this is for double precision,
is that limits on the L2 cache when store two rows of double 32768 (2327688=524288 byte ) and (2655364=524288 byte for single precision)?
i wounder if im right in these predictions ???
the second question is the limit about the max. no. of threads that can be execute at a time ??
(2048 thread/SM *5 SM= 10240 threads, if each thread responsible for computing 1 output element of [4xv], that mean v=10240/4= 2560 !
im confused in these calculations any help pls…
Device 0: “GeForce GTX 1050”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4096 MBytes (4294967296 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1493 MHz (1.49 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024