Increasing the size of the vectors won’t help from a performance measurement point of view, if your timing is including the data transfers, since increasing the size of the vectors also increases the size of the data to be transferred. This particular educational kernel is also using only one threadblock, which is not a high-performance approach to CUDA programming.
The code you are working with has been stripped down to the bare essentials to highlight the concepts that were intended to be presented. That code is easy to break. I’d suggest that taking what is basically the most introductory cudacast, and then using Q+A here to advance your CUDA knowledge is not very efficient. Take advantage of further learning materials that are available:
https://developer.nvidia.com/gpu-computing-webinars
When posting questions here, you’re more likely to get useful help if you show the code you’re working with. Yes, in this case, I can go hunt for it on the internet, but you’ve made changes, right? So show the exact code. It’s not hard to do.
If you want to do CUDA programming, you’ll be well advised to learn more about error handling. First of all, google “proper cuda error checking” and read the first link from stack overflow. For educational purposes, that type of error checking is frequently left out, so as not to obscure the concepts being introduced.
So any time you are having trouble with a CUDA code, your first responses should be:
- make sure you are doing proper CUDA error checking on all CUDA kernels and API calls
- run your code with cuda-memcheck (a very useful debugging utility).
To answer your question, I assume that you modified this line (only):
#define SIZE 1024
to some higher value.
That line is, among other things, determining the number of threads per block. There are no CUDA GPUs currently available which can run with more than 1024 threads per block. When you change that to a higher number, you are increasing the vector lengths, but you are also modifying the config parameters of the kernel:
VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);
^^^
That kernel won’t launch when SIZE is greater than 1024. The programming guide covers many topics, including limits like this:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
With proper CUDA error checking on this kernel launch, you would discover that one of the config parameters was invalid. You’d be well on your way to understanding and solving the problem.