Hi

Few weeks ago I have started coding with CUDA.

first i have implemented matrix mul with shared memory from “Cuda programming guide”.

I’ve made a diagram ilustrating time needed to count matrix from 16x16 to 2100x2100.

Red line means CPU (one core working), blue is gpu.

I don’t understand all realations in this diagram, and thats why i got a few questions:

We get the lowest times, when the size of matrix is divisible by 16(for example 2048 - 1213,723999ms, 2032 - 1157,128052ms).

Is it connected with halfwarp, or the dimension of threads in the block (16x16)

We also get good times when numbers are:

Divisible by 8, but not 16 (2040- 1345ms )

Divisible by 4, but not 16, and 8 (2044- 1647ms)

Divisible by 2, but not 16, 8, and 4 (2046- 1918ms)

Why numbers divisible by 2,4,and 8 are also connected with time of calculations?

Numbers that are close to 128*x +1 are worst choice. Why is it like that?
For example:
1920- 997m
1936- 996ms
Both are divisible by 16, and 1936 is bigger but 1920 is proximal to 1921(128*x+1), and it gets lower time!

We get bigger differences with bigger values.

For example:

2048- 1213ms

2049- 2097ms

It means 72% growth

1048- 190ms

1049- 255ms

It is only 34% growth

Why is it like that?

My hardware is : Dell XPS m1530 (T9300 -2,5Ghz, M8600GT)

I’ll be very thankful for help

A few values for example:

2032 1157,128052

2033 1818,431641

2034 1740,122925

2035 1835,98584

2036 1534,831543

2037 1844,002319

2038 1762,959106

2039 1824,617798

2040 1345,177979

2041 1865,116821

2042 1795,365356

2043 1940,10791

2044 1647,79248

2045 2009,611938

2046 1918,764771

2047 2051,450195

2048 1213,723999

2049 2097,791016