Vector addition on 8600M GT Explaination

Hello, i try to make a little program in CUDA but i don’t find the explaination for that results.

External Media

Can you help me to explain why the performance increase only after 2^25? And why 2^27 is calculed faster than 2^25 ? My GC is a 8600M GT it’s a simple Vector Adition with the size on the X axe.

Thank you :( I’m french so excuse me for my english…

What is the memory size of your 8600 GT?

512M why?

My program is configured to run with 512 thread/block and the number of block. the more the size of the vector is the more you will have block is something like that (i don’t have it right now) : (vector_size+block_number)/block_number

I suspect a bug in the timing code. First, the time (it’s time on y axis, right?) should obviously scale linearly with the number of elements.
Also, when you’re trying to launch the kernel for more than 2^25 elements, you’re hitting the maximum number of blocks in a 1D grid. Your kernels fail to launch, that’s why they’re fast ;) You have 512 threads per block, the maximum number of blocks in a grid is 65536 x 65536. 65536*512 = 2^25.

Vector addition is very likely to be slower on a GPU. I’ve written why here [url=“http://forums.nvidia.com/index.php?s=&showtopic=156070&view=findpost&p=980566”]http://forums.nvidia.com/index.php?s=&...st&p=980566[/url]

How do you do the data-transfer between device and host?

(If you calculate, at around 2^26 in case of floats the 512 MB memory is full, which means that the GPU cannot store all the data, and has to exchange it frequently on the PCI Express, which is very slow).

Thank you for your responses :)

Yes Y = the time in s.

i tried cuda for a school project. So what kind of subject can you advice me, whish is not to difficult to implement?