An Easy Introduction to CUDA C and C++

Originally published at: https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/

Learn more with these hands-on DLI courses: Fundamentals of Accelerated Computing with CUDA C/C++ Fundamentals of Accelerated Computing with CUDA Python Update (January 2017): Check out a new, even easier introduction to CUDA! This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel…

I think there's an error in the line "In this case we use cudaMemcpyHostToDevice to specify that the first argument is a host pointer and the second argument is a device pointer.". Shouldn't it be "In this case we use cudaMemcpyHostToDevice to specify that the first argument is a device pointer and the second argument is a host pointer."? I think you exchanged host pointer with device pointer.

Thanks Nisarg, good catch. I've fixed this error.

Hi, thanks for this tutorial!

I have a GeForce 210 card, and when I run this program, I get

Max error: 2.000000

Whereas you see a max error of zero. It seems like the y[i] array is not getting operated on with my computer setup, but I get no compiler errors with nvcc. When I print out the result of max(maxError, abs(y[i]-4.0f)), I see the value 2.000000 every time, indicating that nothing happened to the y array on the device.

Do you have any advice on what might be going wrong here?

Thanks,
Charles.

I suspect a CUDA error. Unfortunately the code example doesn't check for errors, because that is the lesson of the follow up post. Can you add error checking as shown in the post http://devblogs.nvidia.com/... and see what errors you see?

Ah, I did have some errors. I was using Ubuntu 13.10 and tried some "workarounds" involving third-party PPAs since this isn't an officially supported platform. I couldn't get the workarounds to work it seems, but instead of tracking down the problem, I installed 12.04 LTS and followed the official instructions, now everything seems to be working and I get the "Max error: 0.00000" output now. Yay!

Thank you for taking the time to help me, Mark.

My pleasure, glad you got it working.

There are typos in the first paragraphs of both the Device Code and Host Code sections, x_d and y_d should be d_x and d_y.

I've fixed these. Thank you!

I got the error: #include expects filename, what is the filename?

Sorry about that, the code got mangled by the site. I've fixed it. (The filename is <stdio.h>).

There are missing '&'s in front of 'd_x' and 'd_y' when calling cudaMalloc in the first code snippets of the Host Code section.

Fixed. Thanks!

Thanks for your posting. I'm newbie for CUDA. When I compile your code, i got the same result "Max error: 0.00..0n". However, when I debug it, it doesn't get in __global__ void saxpy function. Is it normal?

I mean it gets in saxpy but it doesn't read

int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];

sorry, one more!
How can "int N = 1<<20" work?
i have no idea.

1<<20 shifts the value 1 left 20 bits. This is equivalent to 2 raised to the power of 20 == 1048576.

What debugger are you using? You need to use a gpu-aware debugger such as cuda-gdb or NSight (Visual Studio Edition or Eclipse Edition).

Thanks Mark!!, then N should be 1048576/32 = 32768 right?

And i'm using Nsight but

__global__

void saxpy(int n, float a, float *x, float *y)

{ <---- here

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

in this function, i cannot get under the first bracket, which i pointed. just empty red circle

is there anything more header file than stdio.h?