An Easy Introduction to CUDA C and C++

I suspect you are not setting a GPU breakpoint. If you are stepping on the host, it will not automatically step into GPU code because different threads are running it, so you need to set a breakpoint in the GPU function. Please check the debugger docs / tutorials.

Not sure what you are asking. That's the only header needed by this example.

Yeab, you are right. It's been just two days I studied this. Thanks for your reply!!

I know this is old, but in your kernel call, why do you pass "2.0f"? Is that because there are two floating point operations taking place in your kernel?

Is just some arbitrary number for 'A' in the 'A*X + Y' expression.

That what I figured after looking at it a bit closer was that it was just a generic constant.

Thanks!

There's a missing > in a < code > HTML tag, just search for "coden" and it should be the only instance on this page (OK, other than mine!)

Fixed -- thanks!

Just a small edit, there's a missing '\' before the 'n' in the printf statement. This post has been a very useful, simple and concise starting point for getting into CUDA. Thanks!

Thank you for the post. I am new to CUDA and would like to clarify some errors I came across. Running the 'nvcc -o saxpy saxpy.cu' on my command promt gives me 'Cannot find compliler 'cl.exe' in Path '. Also I have the following errors on the sample code I plan to run as seen in the screenshot. Does this indicate a mistake in the compiler installation?
I am using Visual Studio 2015 and have a Nvidia geforce 820m.
https://uploads.disquscdn.c...
Thank you
Shrikanth Yadav

Do you have CUDA 8 installed? Previous versions did not support Visual Studio 2015. I can't see the errors in your screenshot.

Hi
I was able to solve the prolem after reinstalling both VS15 and CUDA. The compiling issue is also solved. Thank you

Hi , I'm new in this side and I have final project and I need to use CUDA to handle Big data .
My question how I can read file in CUDA C++?

The host code (which does the file loading) is just regular C/C++. So load files just like you normally would.

Hi, I've run both the .cu code and .cuf code for this example. The .cu code runs as is and gives the proper result, however the .cuf code returns 2.00000. I'm new to CUDA and am wondering if you have any idea why the results are different? The machine I'm using does have 2 gpus installed.

From pgforums... Pascal GPUs need to explicitly generate binaries from cuda-8.0... See link for solution, if interested.
https://www.pgroup.com/user...

What about freeing allocated memory?
I'm beginner in CUDA C, but I think you should free requested memory, formally.

Hi Jacek, great point. I corrected this omission in the post.

I am a CUDA beginner... Thanks for the great tutorial, helped me a lot in getting started!

My question concerns the execution configuration: Out of curiosity, I am also outputting the blockIdx, blockDim and threadIdx for every thread of the saxpy kernel (added one line to void saxpy: printf("Block idx.x, dim.x, threadIdx.x: %i %i %i\n", blockIdx.x, blockDim.x,threadIdx.x);)

Now I created this output 3 times with different execution configurations:
1) above original: I get a list of 4096 rows, the second column for all rows is 256. The sum is 1,048,576 (== N, as expected).
2) configuration: saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y); I still get 4096 rows, but this time the second column is always 512. The total number is 2,097,152.
3) configuration: saxpy<<<(N+127)/128, 128>>>(N, 2.0f,
d_x, d_y); I still get 4096 rows, but this time the second column is
always 128. The total number is 524,288.

I don't understand... Why is the total number of threats always 4096, but the total product of dimension*threads is not preserving N in all three cases?

Also, in the follow-up post (on measuring), the integer division of the execution configuration is changed from /256 to /512, but the comment line still reads "SAXPY on 1M elements". What am I missing?

I think if you look at your code carefully you'll discover that each example is actually trying to print N (= 1M) lines. But you are running up against the printf FIFO default size of 1 MB and getting many fewer than that printed. If you call cudaDeviceSetLimit(cudaLimitPrintfFifoSize, X); for some large value of X you'll get more, but you may instead want to limit it to only print the first thread of every block instead ("if (threadIdx.x == 0)").