GPU speedup query

I ran a modified version of vector addition example in CUDA sdk. where by I have made two vectors of size 100 and adding them to get a third one. I ran it on Tesla 1060 gpu. This gave me a speedup of around 10 units. Shouldn’t it be 100. I was using threadsperblock size of 30. Please find the code attached. Thanks in advance.

vectorAdd.cu (4.61 KB)

Vector addition is limited by memory bandwidth, and the memory bandwidth of a GPU typically is around 10x that of a CPU.

Also, GPU want massively parallel tasks - they can run thousands of threads in parallel. So if you really reach a 10x speedup with a vector of only 100 elements, that’s a pretty good result.

Finally, threadsperblock should be a multiple of 32 (or better yet, 64) to avoid wasting resources through partially occupied warps.