Can you help me to explain why the performance increase only after 2^25? And why 2^27 is calculed faster than 2^25 ? My GC is a 8600M GT it’s a simple Vector Adition with the size on the X axe.
Thank you :( I’m french so excuse me for my english…
My program is configured to run with 512 thread/block and the number of block. the more the size of the vector is the more you will have block is something like that (i don’t have it right now) : (vector_size+block_number)/block_number
I suspect a bug in the timing code. First, the time (it’s time on y axis, right?) should obviously scale linearly with the number of elements.
Also, when you’re trying to launch the kernel for more than 2^25 elements, you’re hitting the maximum number of blocks in a 1D grid. Your kernels fail to launch, that’s why they’re fast ;) You have 512 threads per block, the maximum number of blocks in a grid is 65536 x 65536. 65536*512 = 2^25.
How do you do the data-transfer between device and host?
(If you calculate, at around 2^26 in case of floats the 512 MB memory is full, which means that the GPU cannot store all the data, and has to exchange it frequently on the PCI Express, which is very slow).