I didn’t found the solution, so i did it another way - i evaluate execution time of all GPU time and use gemm 10000 times (to neutralize i/o time) with same input and output arrays and matrix dimension 4096*4096 and i"ve got 2 mSec for 1 sgemm run. It’s very nice result, but it doesn’t looks like real. Could you explain where is the problem?