matrix mul diagram understanding

Few weeks ago I have started coding with CUDA.
first i have implemented matrix mul with shared memory from “Cuda programming guide”.
I’ve made a diagram ilustrating time needed to count matrix from 16x16 to 2100x2100.
Red line means CPU (one core working), blue is gpu.
I don’t understand all realations in this diagram, and thats why i got a few questions:

We get the lowest times, when the size of matrix is divisible by 16(for example 2048 - 1213,723999ms, 2032 - 1157,128052ms).
Is it connected with halfwarp, or the dimension of threads in the block (16x16)

We also get good times when numbers are:
Divisible by 8, but not 16 (2040- 1345ms )
Divisible by 4, but not 16, and 8 (2044- 1647ms)
Divisible by 2, but not 16, 8, and 4 (2046- 1918ms)
Why numbers divisible by 2,4,and 8 are also connected with time of calculations?

Numbers that are close to 128x +1 are worst choice. Why is it like that?
For example:
1920- 997m
1936- 996ms
Both are divisible by 16, and 1936 is bigger but 1920 is proximal to 1921(128
x+1), and it gets lower time!

We get bigger differences with bigger values.
For example:
2048- 1213ms
2049- 2097ms
It means 72% growth
1048- 190ms
1049- 255ms
It is only 34% growth
Why is it like that?

My hardware is : Dell XPS m1530 (T9300 -2,5Ghz, M8600GT)

I’ll be very thankful for help

A few values for example:
2032 1157,128052
2033 1818,431641
2034 1740,122925
2035 1835,98584
2036 1534,831543
2037 1844,002319
2038 1762,959106
2039 1824,617798
2040 1345,177979
2041 1865,116821
2042 1795,365356
2043 1940,10791
2044 1647,79248
2045 2009,611938
2046 1918,764771
2047 2051,450195
2048 1213,723999
2049 2097,791016


Hi there, I’m fairly new to CUDA and so no expert yet but yes this is primarily due to the size fitting in a half or full warp. Further sub devisions therein (8,4,2…) similarly allow for more processing per warp. The other issues such as % increase with size are more likely to come from memory management issues, the amount of available shared memory (if you are using it) and the amount of registers, but thats just a guess.

First of all thx for your replay!

I don’t think, that the size of matrix is influencing time score because of memory managment.

We have always the same number of registers, shared, and constant memory per multiprocesor indepentently from size of a matrix.

It is always : “Used 12 registers, 2112+1088 bytes smem, 8 bytes cmem[1]”

Only global memory is getting bigger, but how it could influence time score?

You said that this is primarily due to the size fitting in a half or full warp. Can you tell me why?

This code is always running kernel with 256 threads (16x16) per block independetly from size. Why it could be bad fit?

I would be very grateful for any advice!


I have made another diagram, so maybe it will help U guys to answer my questions.

Thx for any sugestions!