An Easy Introduction to CUDA C and C++

jwitsoe · November 11, 2013, 11:41pm

Originally published at: https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/

Learn more with these hands-on DLI courses: Fundamentals of Accelerated Computing with CUDA C/C++ Fundamentals of Accelerated Computing with CUDA Python Update (January 2017): Check out a new, even easier introduction to CUDA! This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel…

anon50869021 · February 16, 2014, 6:15am

I think there's an error in the line "In this case we use cudaMemcpyHostToDevice to specify that the first argument is a host pointer and the second argument is a device pointer.". Shouldn't it be "In this case we use cudaMemcpyHostToDevice to specify that the first argument is a device pointer and the second argument is a host pointer."? I think you exchanged host pointer with device pointer.

anon95180265 · February 17, 2014, 12:40am

Thanks Nisarg, good catch. I've fixed this error.

anon14663003 · March 13, 2014, 1:18pm

Hi, thanks for this tutorial!

I have a GeForce 210 card, and when I run this program, I get

Max error: 2.000000

Whereas you see a max error of zero. It seems like the y[i] array is not getting operated on with my computer setup, but I get no compiler errors with nvcc. When I print out the result of max(maxError, abs(y[i]-4.0f)), I see the value 2.000000 every time, indicating that nothing happened to the y array on the device.

Do you have any advice on what might be going wrong here?

Thanks,
Charles.

anon95180265 · March 14, 2014, 2:12am

I suspect a CUDA error. Unfortunately the code example doesn't check for errors, because that is the lesson of the follow up post. Can you add error checking as shown in the post http://devblogs.nvidia.com/... and see what errors you see?

anon14663003 · March 15, 2014, 11:19am

Ah, I did have some errors. I was using Ubuntu 13.10 and tried some "workarounds" involving third-party PPAs since this isn't an officially supported platform. I couldn't get the workarounds to work it seems, but instead of tracking down the problem, I installed 12.04 LTS and followed the official instructions, now everything seems to be working and I get the "Max error: 0.00000" output now. Yay!

Thank you for taking the time to help me, Mark.

anon95180265 · March 16, 2014, 11:03pm

My pleasure, glad you got it working.

anon3296546 · March 17, 2014, 3:47pm

There are typos in the first paragraphs of both the Device Code and Host Code sections, x_d and y_d should be d_x and d_y.

anon95180265 · March 18, 2014, 12:50am

I've fixed these. Thank you!

anon78042882 · July 9, 2014, 3:17pm

I got the error: #include expects filename, what is the filename?

anon95180265 · July 10, 2014, 1:00am

Sorry about that, the code got mangled by the site. I've fixed it. (The filename is <stdio.h>).

anon98545019 · July 12, 2015, 8:20am

There are missing '&'s in front of 'd_x' and 'd_y' when calling cudaMalloc in the first code snippets of the Host Code section.

anon95180265 · July 12, 2015, 11:11pm

Fixed. Thanks!

anon28772862 · September 3, 2015, 2:37am

Thanks for your posting. I'm newbie for CUDA. When I compile your code, i got the same result "Max error: 0.00..0n". However, when I debug it, it doesn't get in __global__ void saxpy function. Is it normal?

anon28772862 · September 3, 2015, 2:38am

I mean it gets in saxpy but it doesn't read

int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];

anon28772862 · September 3, 2015, 2:40am

sorry, one more!
How can "int N = 1<<20" work?
i have no idea.

anon95180265 · September 3, 2015, 2:45am

1<<20 shifts the value 1 left 20 bits. This is equivalent to 2 raised to the power of 20 == 1048576.

anon95180265 · September 3, 2015, 2:46am

What debugger are you using? You need to use a gpu-aware debugger such as cuda-gdb or NSight (Visual Studio Edition or Eclipse Edition).

anon28772862 · September 3, 2015, 5:19am

Thanks Mark!!, then N should be 1048576/32 = 32768 right?

And i'm using Nsight but

__global__

void saxpy(int n, float a, float *x, float *y)

{ <---- here

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

in this function, i cannot get under the first bracket, which i pointed. just empty red circle

anon28772862 · September 3, 2015, 5:20am

is there anything more header file than stdio.h?

Topic		Replies	Views
An Even Easier Introduction to CUDA Technical Blog	141	6055	November 28, 2023
An Easy Introduction to CUDA Fortran Technical Blog	7	565	June 21, 2024
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2974	November 13, 2017
CUDA very slow performance CUDA Programming and Performance	21	16410	March 6, 2020
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134546	May 26, 2010
Using unified memory causes system crash CUDA Programming and Performance	28	5781	February 4, 2019
simplest programming environment (editor) for Cuda? CUDA Programming and Performance	23	22847	March 13, 2009
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4767	April 8, 2008
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7110	December 6, 2007
Cuda code performance CUDA Programming and Performance	14	3090	December 16, 2014

An Easy Introduction to CUDA C and C++

Related topics