I am working on applying CUDA on image processing,and recently I have found something rather confusing.
Enviroment: CUDA 1.0 ,8800GTS,4G memory,CORE2 DUO 6500E,162.01_int driver,Windows XP, DX9(I think it has nothing to do with the DX ,though)
shared memory usage: ~15k/block
In my programme, I have to run a function 74 times. I wrote my function 74 times,rather than a for loop.
I cut time for only one run, and it took only 0.041 ms, which is pretty good. and the first 20 run cost only 1-2 ms, GOOD.
40 run ,14 ms
I have to run it 74 times ,and it cost 53 ms! This is by no means fast comparing to CPU.
In the SDK “transpose” exsample a similar problem occured:
itrationtimes: naive optimized
1 : 0.580 : 0.011
2 : 0.292 : 0.011
8 : 0.081 : 0.012 //still 10x speedup as intro said
9 : 0.072 : 0.435 //!
16 : 0.046 : 3.553 //not only not stable ,but also much slower
and when iteration continues to increase ,speed was still not stable ,and performance lost was very huge.
40 : 2.207 : 1.895
100 : 3.156 : 1.051
200 : 3.473 : 0.767 //optimized 75times slower,naive 6 times slower
I don`t know what caused this, compiler hides the memory transpost time?CPU-GPU comunication caused speed lost?
Anybody have run transpose example and got a better result?Which Speed is real?Can the lost performance be found back by change settings or using CUDA 1.1 and later driver?Thanks for reply.
At least, maybe the cut time fuction is not reliable enough to estimate the speed of your programme.