An Even Easier Introduction to CUDA

Do you have a similar post for Python?

Hello,
I profiled the add_grid code on a Titan X GPU (latest one), and only got about 3.5ms for the add() function. I don't know what is wrong. The nvcc version is release 8.0, V8.0.44. And my code is below. I've tried various blocksize, from 128 to 8*128, but none gave me anything faster than 3ms.

include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__

void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<20;
float *x, *y;
int blockSize = 8*128;
int numBlocks = (N + blockSize - 1) / blockSize;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Run kernel on 1M elements on the GPU
add<<<numblocks, blocksize="">>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;

1 Like

Not exactly the same, but here are two older posts on "CUDA Python" using the NumbaPro compiler.
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...

and also a few screencasts:
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...
https://devblogs.nvidia.com...

Mark,
I'm a CUDA noobie and felt some intimidation entering the world of GPU processing. This introduction was quite helpful for me. I ran the examples on a p2.xlarge instance (1 Tesla K80 GPU) under Centos 7 on Amazon's EC2, after installing the CUDA 8 Toolkit. All the examples ran as expected without any issues.

The next day it occurred to me that I didn't fully understand and appreciate the 'Grid Stride Loop' in the final example that takes advantage of multiple SMs. That's because I didn't read the article on the Grid-Stride Loop. I couldn't figure out why there was a 'for' loop, rather than the monolithic kernel (with just the if condition). Your post on the Grid-Stride Loop provided the insight I needed.

To sum up:
I had a lot of fun reading your article and running the examples.
Thank you for this great introduction to CUDA!
Jonathan Joseph

1 Like

Is there a tutorial for cuDNN similar to this CUDA Introduction?

cuDNN is really intended to be used by developers of Deep Learning Frameworks, so there isn't a beginner's introduction, as such. I suspect what you really want is a beginner's introduction to Caffe, or TensorFlow, or Torch, etc. There are many such introductions available on the web.

...example needs a continuation - how to engage second cpu on K80 with the same code?)

That makes sense. I've started looking at TensorFlow, which I have running on an Amazon p2.xlarge instance that hosts a Tesla K80 GPU. Thanks for your insights.

Not a bad suggestion! I'll add it to my list. :)

We calculated the block size and the number of threads in each block so that each one of the one million elements is worked on by a single thread. I've seen CUDA codes for the same problem but for loop wasn't used in any of those. Can you please explain why are we using the for loop and why can't we just add all the elements with their individual threads?

The post explains that this is known as a grid-stride loop and it even links to a post explaining the benefits: https://devblogs.nvidia.com...

You can do it without a loop but you are limited to processing as many elements as the maximum number of threads you can launch (which is a lot, but not as many as the largest array you can allocate).

Mentioned by author Udacity course Intro to Parallel Programming is great!

Hi,

What is the answer to the question "If you have access to a Pascal-based GPU, try running add_grid.cu on it. Is performance better or worse than the K80 results? Why?" ? I have tried with a Pascal Titan X GPU and the add_cuda seem indeed to run slower than on my GTX 750, but I didn't really get why; does it have to do with the new unified memory?

I'm glad you asked! The short answer is because on pre-pascal GPUs that can't page fault, the data that has been touched on the CPU has to be copied (automatically) to the GPU before the kernel launches. But that copy is not included in the timing of the kernel in the profiler. On Pascal, only the data that is touched by the GPU is migrated, and it happens on demand (at the time of a GPU page fault) -- in this case though that means all the data. The cost of the page faults and migrations therefore impacts the runtime of the kernel on Pascal. If you run the kernel twice, though, you'll see that the second run is faster (because there are no page faults -- the data is already all on the GPU). There are other options, like initializing the data on the GPU. I'm preparing a follow up post which goes over this in some detail. See https://devblogs.nvidia.com... and
https://devblogs.nvidia.com...

I see, thank you!

Trying to get into parallel computing and CUDA. I had a look at the works from Kurt Keutzer (Berkeley) and Tim Mattson (Intel) on a language for parallel design patterns and it looked amazing to me (check link here http://parlab.eecs.berkeley.... Is this work considered a reference in the domain, or is it just another attempt among many others? In which case what should I look at in terms of design patterns and consolidating my understanding from the long built experience of others?

Hi, I'm a newbie learning cuda. I followed the same step in this page and got pretty similar result except the last experiment. I did exactly same coding on my computer but last experiment result shows same time elapsed with second last one like 3.6ms on both. My GPU is GTX 1080 and I found that it can make 1048576 threads at same time with threads-max function. Can you give me a hint about this result? I really need someone to help me.

You might look at my answer to the question by Max below. GTX 1080 is a Pascal GPU. I still plan to write a followup post to explain this but I have been swamped with other work.

Thank you for fast answer.
So what you are saying is 1080 is a Pascal GPU and when we see the results of nvprof, we see the whole time spent including memory transfer and stuff? Which makes sense about so little time difference between last and second last experiment. But your K80 does not?

First, remember that before the kernel runs, all the data in the array was last accessed on the CPU (during initialization). In both cases (K80 and 1080) nvprof is just timing the kernel execution. But on 1080, when the GPU accesses the Unified Memory array it page faults (because it's resident in CPU memory). The threads that fault on each page have to wait for the page to be migrated to the GPU. These migrations get included in the kernel run time measured by nvprof.

But K80 is incapable of page faulting. So when you launch the kernel, first the driver has to migrate all the pages touched by the CPU (whole array in this case) back to the GPU before running the kernel. Since it happens before the kernel runs on the GPU, nvprof doesn't include that in the kernel run time.

On your 1080, if you run the kernel twice, you'll see the minimum kernel run time will be lower than the maximum -- because once the memory is paged in by the first run, the page faults don't happen on the second run.

1 Like