BUG? shared memory using in matrixMul

matrixMul, C = A * B;
A:10050
B:50
50
C:A*B
Language with Shared(ms) no Shared(ms)
OpenCL 90.5 78.0
CUDA 14.6 78.2

It’s really amazing and fabulous to find 90.5 in oclMatrixMul! It’s more expensive than the version without using shared memory. Program is absolutely correct, but how to interpret the number tested here?

Also, I tested transpose, A=AT. And the result looks normal.
A: 4096 * 256
Language with Shared(ms) no Shared(ms)
OpenCL 0.43 2.39
CUDA 0.348 2.491

Also, I tested oclMatrixMul. Given matrix A, I add iNumber to it iteratively, that is “for(i=0;i<iNumber;i++) A++;”. And the result looks nothing strange.
iNumber with Shared(ms) no Shared(ms)
1 0.01471 0.00675
10 0.02555 0.01168
100 0.09107 0.07662
1000 0.73986 0.72410
10000 7.22309 7.20312

so who could interpret this “bug”, or is it something related to the driver? My platform is as follows: Tesla C1060, VS2005, windows XP 64-bit, driver 190.90