I’m working on DNN optimize,most of them are matrix multiplication.

I test different size square matrix by clbasSgemm().

I test in a GTX1080 board with cuda 8.0. I find the different matrix size Ｎ has different performance .

when N <512, do 1000 times N size matrix mul ,used time : (0-3)ms,time increase with N peacefully.

but when N=513 ,the time increase to 80ms .Then about N increase 100 the time will come a new high level.

1.what’s influence the critical matrix size for cblasSgemm() performance? Device memory or compute unite resource?

2.How to tune matrix size for best performance.

3.When use multiple stream, how to tune matrix size to make sure multiple stream with sgemm compute concurrent on one gpu card.

please don’t care my poor English.

Thanks.

Here is my test result table: