please explain me the output graph that i got!

I am new to CUDA .I wanted to test some code and run on GPU .So i took a sample code that i got online .The code gives the time of execution on GPU .Then i gave different combinations of number of blocks and number of threads for the kernel under the condition that product of blocks and threads should be less than 512(num of blocks * num of threads <=512)

I ran the program and the output i am attaching

( the output format is )

Then i did graph on the output file in matlab .The output graph wasnot as expected…

There were peaks(major fluctuations)in between the graph at many places and minor fluctuations (can be ignored i guess) .

My point was to find what is the optimum number of threads and blocks combination for the program (i want to do the same for my project some time later ).

By the way i am connected to a remote system which has GPU on it !!

I am attaching code ,plus the output file

My question is :

1.Can anyone explain me why this graph is that way ??(peaks at many places )
output.txt (43.3 KB) (5.9 KB)

After eyeballing your time values, they all seem to be the same (little varriance) which is unexpected (512 blocks, 1 thread/block) should be awefully slow, unless you have 512 multiprocessors.

You’re timing results are incorrect because you didn’t insert cudaThreadSynchronize() before starting & stopping timing.