VectorAdd example from CUDACast #2

Hello, I am new in CUDA programming. I work on VC 2012 Pro having Installed the CUDA Toolkit v.6.0 and I am trying to recreate the example shown on CUDACast #2 from youtube. My first CUDA program.

Although everything is working as shown on the video, when I try to check for execution times I realize that instead of an accelerated program I get a slower one with CUDA.

I check my times using a timer.h header similar to the one used on CUDACast #3 and while I get about 0.000012 seconds for the CPU program, I reach 0.000063 seconds with the CUDA program. Please help!

I am using a GEFORCE GTX 750.

Your first CUDA program is intended to teach you the basics, not necessarily be fast. Not every CUDA program will be faster than some corresponding CPU program. Depending on how you do the timing, Vector Add will not necessarily be faster (by itself) on the GPU, due perhaps to the time cost of transferring the data to and from the GPU.

Thank you very much for the quick response. So i try to increase the size of the vectors to produce more computing workload. The CPU execution time goes up but the CUDA code cannot add the vectors properly. The result is all zeros. Why is this happening?

Increasing the size of the vectors won’t help from a performance measurement point of view, if your timing is including the data transfers, since increasing the size of the vectors also increases the size of the data to be transferred. This particular educational kernel is also using only one threadblock, which is not a high-performance approach to CUDA programming.

The code you are working with has been stripped down to the bare essentials to highlight the concepts that were intended to be presented. That code is easy to break. I’d suggest that taking what is basically the most introductory cudacast, and then using Q+A here to advance your CUDA knowledge is not very efficient. Take advantage of further learning materials that are available:

https://developer.nvidia.com/gpu-computing-webinars

When posting questions here, you’re more likely to get useful help if you show the code you’re working with. Yes, in this case, I can go hunt for it on the internet, but you’ve made changes, right? So show the exact code. It’s not hard to do.

If you want to do CUDA programming, you’ll be well advised to learn more about error handling. First of all, google “proper cuda error checking” and read the first link from stack overflow. For educational purposes, that type of error checking is frequently left out, so as not to obscure the concepts being introduced.

So any time you are having trouble with a CUDA code, your first responses should be:

  1. make sure you are doing proper CUDA error checking on all CUDA kernels and API calls
  2. run your code with cuda-memcheck (a very useful debugging utility).

To answer your question, I assume that you modified this line (only):

#define SIZE 1024

to some higher value.

That line is, among other things, determining the number of threads per block. There are no CUDA GPUs currently available which can run with more than 1024 threads per block. When you change that to a higher number, you are increasing the vector lengths, but you are also modifying the config parameters of the kernel:

VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);
                ^^^

That kernel won’t launch when SIZE is greater than 1024. The programming guide covers many topics, including limits like this:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

With proper CUDA error checking on this kernel launch, you would discover that one of the config parameters was invalid. You’d be well on your way to understanding and solving the problem.