An Even Easier Introduction to CUDA

jwitsoe · January 25, 2017, 12:33pm

Originally published at: https://developer.nvidia.com/blog/even-easier-introduction-cuda/

Learn more with these hands-on DLI courses: Fundamentals of Accelerated Computing with CUDA C/C++ Fundamentals of Accelerated Computing with CUDA Python This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous “Easy Introduction” to CUDA in 2013 that has been very popular over…

ht198 · February 22, 2017, 1:52am

Hi Sir, thanks for the tutorial. I tried this code on Tesla T4, and found that using 4096 blocks does not improve the performance to us level compare to 1 block 256 threads. What might be the reason ? thanks.

anon12983301 · January 27, 2017, 9:07pm

Thanks for the post Mark. To get accurate profiling, is it still a good idea to put cudaDeviceReset() just prior to exiting? https://devblogs.nvidia.com...

Also, is it possible to get this level of timing via code? cudaEventElapsedTime does not seem to have this same level of precision.

anon2160791 · January 29, 2017, 1:04pm

Thanks Mark. I tried this and it did not work as a straight copy and paste. The cudaMallocManaged did not appear to do anything. This is on a Titan X with up to date drivers and NSight. I replaced the cudaMallocManaged functionality with the relevant cudaMalloc and cudaMempy, which sorted it. Am I missing something wrt cudaMallocManaged?

anon2160791 · January 29, 2017, 1:15pm

It's a really good post btw. Thanks - I have learned much.

anon95180265 · January 29, 2017, 9:51pm

What do you mean did not appear to do anything? Did you get an error? Incorrect results? What CUDA version do you have installed? Is it an "NVIDIA Titan X" (Pascal) or "GeForce GTX Titan X" (Maxwell)?

anon95180265 · January 29, 2017, 10:04pm

I think the Windows tools are more dependent on cudaDeviceReset(). I kept it out of this post to keep things simple. cudaEventElapsedTime() should have the same level of precision, but in more complex apps you may get things in your timing that you didn't intend.

I believe the most reliable way to accurately time is to run your kernel many times in a loop, followed by cudaDeviceSynchronize() or cudaStreamSynchronize(), and use a high precision CPU timer (like std::chrono) to wrap the whole loop and the sync. Then divide by the number of iterations.

anon2160791 · January 29, 2017, 11:06pm

CUDA 8 latest. It's the Pascal card. After executing the cudaMallocManaged function the variable pointed to address 0x0. I'll put error checking into your original code in the morning and try to get more diagnostic information.

anon49369714 · January 30, 2017, 1:31am

I cannot access variable assigned using 'cudaMallocManaged' on host. It throws '0xC0000005: Access violation writing location 0x00000000' error. I can access them fine on device kernel. Am I missing something ? I am using MS Visual Studio Community 2015 with CUDA Runtime 8.0 on GTX 1070.

anon49369714 · January 30, 2017, 4:02am

I think I had the same problem. Var pointed to '0x0' and it was not accessible by host. However accessible by device. I know it works the old was using separate var for host and device. But it would be good if we can make it work without all those memCpy like this example does. Tell me if you have any luck.

anon95180265 · January 30, 2017, 4:30am

Did you guys change the program at all? If you share your changes I can try to diagnose.

anon95180265 · January 30, 2017, 4:31am

Did you change the code, or are you getting the error in the initialization loop?

anon49369714 · January 30, 2017, 4:41am

I didn't change the code at all. Just copied it to visual studio. Yes, I got error on initialization loop. So I put initialization in kernel. Got error at verification as expected. Removed verification and it ran fine. From this I concluded that it's not accessible by host. Further mode in VS debugging watch it shows it the same was as var allocated by cudaMalloc, some can't read or something indicating that it's not accessible to CPU as I understand. And the address is 0x0. Sorry for long response. I really appreciate your help.

anon2160791 · January 30, 2017, 8:49am

I just put the unchanged original on a box with a GT 755M and it fails similarly. What have I done wrong?
:::

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1 << 20;
float *x, *y;

// Allocate Unified Memory ñ accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Run kernel on 1M elements on the GPU
add << <1, 1 >> >(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;
}

anon2160791 · January 30, 2017, 10:10am

It is not sufficient to change the the target to x64 in the CUDA properties solution properties and Active(x64) in the properties - you must change it in the solution configuration manager. It then execs fine. The error is simply a 'not supported' one.
Thanks Mark - suspected it was my bad.

anon2160791 · January 30, 2017, 10:11am

You need to change the solution properties to x64 in the Configuration Manager. Changin it in the CUDA props and in the Active platform dropdown don't get it done.

anon49369714 · January 31, 2017, 12:33am

Thanks, that worked !

anon95180265 · February 1, 2017, 11:46pm

any chance you can post a screenshot of this, @mike_agius:disqus? Thanks!

anon80926852 · February 3, 2017, 6:48am

I have used this tech way back and am quite pleased with the results .

The speedup is so adictive that i swear by my program

I encorage all to give cuda a must try to solve problems even if it requires nvidia specific hardware

.This is my code developed when i was grad student

http://bit.ly/2ko6rNb

anon2160791 · February 3, 2017, 8:36am

Sure https://uploads.disquscdn.c...

Topic		Replies	Views
An Easy Introduction to CUDA C and C++ Technical Blog	48	1286	July 19, 2018
Unified Memory for CUDA Beginners Technical Blog	46	2621	December 1, 2023
CUDA very slow performance CUDA Programming and Performance	21	16818	March 6, 2020
Unified Memory in CUDA 6 Technical Blog	87	1969	August 16, 2019
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134796	May 26, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9686	December 4, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204598	April 13, 2009
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8220	February 17, 2015
CUDA 8 Features Revealed Technical Blog	51	914	November 8, 2018
Using unified memory causes system crash CUDA Programming and Performance	28	5930	February 4, 2019

An Even Easier Introduction to CUDA

Related topics