An Even Easier Introduction to CUDA

anon57382274 · February 3, 2017, 7:18pm

Do you have a similar post for Python?

anon59957877 · February 3, 2017, 7:23pm

Hello,
I profiled the add_grid code on a Titan X GPU (latest one), and only got about 3.5ms for the add() function. I don't know what is wrong. The nvcc version is release 8.0, V8.0.44. And my code is below. I've tried various blocksize, from 128 to 8*128, but none gave me anything faster than 3ms.

include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__

void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<20;
float *x, *y;
int blockSize = 8*128;
int numBlocks = (N + blockSize - 1) / blockSize;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Run kernel on 1M elements on the GPU
add<<<numblocks, blocksize="">>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;

anon95180265 · February 6, 2017, 4:57am

Not exactly the same, but here are two older posts on "CUDA Python" using the NumbaPro compiler.
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...

and also a few screencasts:
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...

anon50443555 · February 8, 2017, 9:24pm

Mark,
I'm a CUDA noobie and felt some intimidation entering the world of GPU processing. This introduction was quite helpful for me. I ran the examples on a p2.xlarge instance (1 Tesla K80 GPU) under Centos 7 on Amazon's EC2, after installing the CUDA 8 Toolkit. All the examples ran as expected without any issues.

The next day it occurred to me that I didn't fully understand and appreciate the 'Grid Stride Loop' in the final example that takes advantage of multiple SMs. That's because I didn't read the article on the Grid-Stride Loop. I couldn't figure out why there was a 'for' loop, rather than the monolithic kernel (with just the if condition). Your post on the Grid-Stride Loop provided the insight I needed.

To sum up:
I had a lot of fun reading your article and running the examples.
Thank you for this great introduction to CUDA!
Jonathan Joseph

anon50443555 · February 9, 2017, 2:22am

Is there a tutorial for cuDNN similar to this CUDA Introduction?

anon95180265 · February 20, 2017, 11:49pm

cuDNN is really intended to be used by developers of Deep Learning Frameworks, so there isn't a beginner's introduction, as such. I suspect what you really want is a beginner's introduction to Caffe, or TensorFlow, or Torch, etc. There are many such introductions available on the web.

anon95041436 · February 21, 2017, 10:30am

...example needs a continuation - how to engage second cpu on K80 with the same code?)

anon50443555 · February 21, 2017, 7:41pm

That makes sense. I've started looking at TensorFlow, which I have running on an Amazon p2.xlarge instance that hosts a Tesla K80 GPU. Thanks for your insights.

anon95180265 · February 21, 2017, 10:07pm

Not a bad suggestion! I'll add it to my list. :)

anon81287510 · March 15, 2017, 12:50pm

We calculated the block size and the number of threads in each block so that each one of the one million elements is worked on by a single thread. I've seen CUDA codes for the same problem but for loop wasn't used in any of those. Can you please explain why are we using the for loop and why can't we just add all the elements with their individual threads?

anon95180265 · March 16, 2017, 5:06am

The post explains that this is known as a grid-stride loop and it even links to a post explaining the benefits: https://devblogs.nvidia.com...

You can do it without a loop but you are limited to processing as many elements as the maximum number of threads you can launch (which is a lot, but not as many as the largest array you can allocate).

anon53073872 · March 16, 2017, 7:04am

Mentioned by author Udacity course Intro to Parallel Programming is great!

anon6696329 · March 22, 2017, 1:26pm

Hi,

What is the answer to the question "If you have access to a Pascal-based GPU, try running add_grid.cu on it. Is performance better or worse than the K80 results? Why?" ? I have tried with a Pascal Titan X GPU and the add_cuda seem indeed to run slower than on my GTX 750, but I didn't really get why; does it have to do with the new unified memory?

anon95180265 · March 24, 2017, 5:47am

I'm glad you asked! The short answer is because on pre-pascal GPUs that can't page fault, the data that has been touched on the CPU has to be copied (automatically) to the GPU before the kernel launches. But that copy is not included in the timing of the kernel in the profiler. On Pascal, only the data that is touched by the GPU is migrated, and it happens on demand (at the time of a GPU page fault) -- in this case though that means all the data. The cost of the page faults and migrations therefore impacts the runtime of the kernel on Pascal. If you run the kernel twice, though, you'll see that the second run is faster (because there are no page faults -- the data is already all on the GPU). There are other options, like initializing the data on the GPU. I'm preparing a follow up post which goes over this in some detail. See https://devblogs.nvidia.com... and
https://devblogs.nvidia.com...

anon6696329 · March 24, 2017, 8:27am

I see, thank you!

anon29301916 · March 29, 2017, 9:21pm

Trying to get into parallel computing and CUDA. I had a look at the works from Kurt Keutzer (Berkeley) and Tim Mattson (Intel) on a language for parallel design patterns and it looked amazing to me (check link here http://parlab.eecs.berkeley.... Is this work considered a reference in the domain, or is it just another attempt among many others? In which case what should I look at in terms of design patterns and consolidating my understanding from the long built experience of others?

anon88452099 · April 25, 2017, 1:22pm

Hi, I'm a newbie learning cuda. I followed the same step in this page and got pretty similar result except the last experiment. I did exactly same coding on my computer but last experiment result shows same time elapsed with second last one like 3.6ms on both. My GPU is GTX 1080 and I found that it can make 1048576 threads at same time with threads-max function. Can you give me a hint about this result? I really need someone to help me.

anon95180265 · April 26, 2017, 12:56am

You might look at my answer to the question by Max below. GTX 1080 is a Pascal GPU. I still plan to write a followup post to explain this but I have been swamped with other work.

anon88452099 · April 26, 2017, 5:00am

Thank you for fast answer.
So what you are saying is 1080 is a Pascal GPU and when we see the results of nvprof, we see the whole time spent including memory transfer and stuff? Which makes sense about so little time difference between last and second last experiment. But your K80 does not?

anon95180265 · April 26, 2017, 5:09am

First, remember that before the kernel runs, all the data in the array was last accessed on the CPU (during initialization). In both cases (K80 and 1080) nvprof is just timing the kernel execution. But on 1080, when the GPU accesses the Unified Memory array it page faults (because it's resident in CPU memory). The threads that fault on each page have to wait for the page to be migrated to the GPU. These migrations get included in the kernel run time measured by nvprof.

But K80 is incapable of page faulting. So when you launch the kernel, first the driver has to migrate all the pages touched by the CPU (whole array in this case) back to the GPU before running the kernel. Since it happens before the kernel runs on the GPU, nvprof doesn't include that in the kernel run time.

On your 1080, if you run the kernel twice, you'll see the minimum kernel run time will be lower than the maximum -- because once the memory is paged in by the first run, the page faults don't happen on the second run.

Topic		Replies	Views
An Easy Introduction to CUDA C and C++ Technical Blog	48	1108	July 19, 2018
Unified Memory for CUDA Beginners Technical Blog	46	2484	December 1, 2023
CUDA very slow performance CUDA Programming and Performance	21	16410	March 6, 2020
Unified Memory in CUDA 6 Technical Blog	87	1892	August 16, 2019
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134546	May 26, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9428	December 4, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
CUDA 8 Features Revealed Technical Blog	51	848	November 8, 2018
Using unified memory causes system crash CUDA Programming and Performance	28	5781	February 4, 2019
CUDA compiler bug or user error? CUDA Programming and Performance	28	2464	July 28, 2017

An Even Easier Introduction to CUDA

Related topics