An Even Easier Introduction to CUDA

anon8594669 · November 24, 2018, 4:29pm

Thank you very much for this CUDA introduction page.

On the first read I have also skipped the grid-stride-loop section and missed the point of the kernel loop... Maybe it could be emphasized a little more ;)

anon39720688 · December 28, 2018, 3:13pm

Use Ubuntu for development instead of that Microsoft shit.

anon39720688 · December 28, 2018, 3:23pm

Is it so that in this grid-stride loop you don't actually need the for structure at all, because it runs only one step (the current index) per thread?

anon17425162 · February 2, 2019, 4:38am

This, but on Ubuntu 18.04 with GeForce 750 Ti. I tried running the code on the other site but it gave no output at all.
UPDATE: Nevermind! Just had to reboot the system..

anon17425162 · February 2, 2019, 4:47am

~~Actually some of the code on the older version of this post is working, but the 750 is based on Maxwell (post Kepler) so still have to learn why the add_cuda code isn't working.~~

anon88899281 · April 19, 2019, 6:44pm

Thank for the neat tutorial.
But you didn't talk about (N+blockSize-1)/blockSize while a line above that, you said about ROUNDUP(N/blockSize)

anon95180265 · April 22, 2019, 9:30pm

The comment "being careful to round up" is referring specifically to `(N+blockSize-1)/blockSize`, because that's exactly what that code does.

anon34918456 · June 5, 2019, 1:39am

Not in Visual Studio. Also, you should comment this in the text!

anon95180265 · June 5, 2019, 4:01am

I assume you mean Visual Studio on Windows, since I don't have any issue with this on Visual Studio Code on Linux. NVCC's behavior regarding CUDA headers should be the same across OSes. If it's not, I suggest you ask on devtalk.nvidia.com

anon34918456 · June 5, 2019, 6:57pm

I'm using VS 2017 and CUDA 10.1. I was able to compile your sample only after including that line.

anon51320639 · August 14, 2019, 2:43pm

Wow, that is agreat gain in time. I appreciate your explanation which was really easy to follow and understand . Thanks!

anon47210350 · August 15, 2019, 2:49am

You say "a Tesla P100 GPU... has 56 SMs, each capable of supporting up to 2048 active threads" but the table in the linked page says there are only 64 "FP32 CUDA Cores / SM" and 32 "FP64 CUDA Cores / SM" for a total of 96 cores per SM. And it's just 1 thread per core, right? What am I missing?

anon95180265 · August 15, 2019, 4:00am

Active threads are not the same as instruction throughput. A thread is an execution context. So the SM has enough resources to maintain the state of 2048 active threads (64 warps). The SM issues and retires instructions from one or more warps per cycle. So you have floating point units capable of 2 FP32 warp-instructions per cycle (64 FMA per cycle across two warps). Or 1 FP64 warp-instruction per cycle. These may be issued from any of the up to 64 resident warps that have an instruction ready at a given cycle. Of course, there's more to a program than floating point, so the SMs also have units for executing integer math, branches, logic ops, memory loads and stores, etc, each with its own maximum throughput.

anon47210350 · August 15, 2019, 6:12am

I see. Thanks for the reply. That's the ambiguity of the word "thread" I suppose. Since you mentioned integers, one more question: Have you known anyone to benefit from GPUs (in terms of calculations-per-dollar) with computation that is heavy on 128-bit integer arithmetic? I understand that the emulation required for such oversized integers comes at a high price.

anon96238348 · October 19, 2019, 12:00am

Hi, Mark!
First of all, thank you for this post, it helped me start with cuda programming.
But I have some questions.

Did you forget some line of code when transitioning from 1 block to multiple blocks?
On my tests, 1 block results in around 3 ms to process the 1 million sums, but changing to multiple blocks had no effect at all.
It's still taking the same time as 1 block.
Maybe it's something on my config?
I'm on Manjaro linux. With Cuda 10.1.243 and nvidia-430xx drivers.
GTX 1060 3GB

Another thing I noticed is that the stride loop on the multiple blocks example is unnecessary, since the total amount of threads match the array size.
Is there a way to create the threads to always match the array size exactly or is it safer to always use the stride-loop?

anon96238348 · October 19, 2019, 12:31am

Hi, Fergal,

I also didn't get a speed up when adding more blocks.
Did preloading the memory improve the run time for you?

Thanks.

anon96238348 · October 19, 2019, 12:55am

Nevermind.

I found your other post explaining about synchronizing the memory before executing the kernel and it now runs at ~80 us with multiple blocks.
Thank you for doing these posts!

anon44319044 · December 12, 2019, 9:58am

This is also my observation. If you print value of 'stride' inside of the kernel, it's the same as N (for the provided example). It seems that numBlocks value would have to be reduced in order for the 'for' loop to run more than a single iteration per thread. Can you clarify this?

anon95858191 · February 4, 2020, 3:25am

Hi, I am having the same results(GTX 1060), how did you get over this?

anon96238348 · February 4, 2020, 5:41am

Hello, I forgot which post he talks about this, but the gist of the problem is that (at least based on the code presented here) the graphics card only loads the values from RAM when you tell it to run your kernel and it tries to load the values. So every time a value from the array is needed, the graphics card stalls waiting for the value to come from RAM.
So you need to actively synchronize the values to the graphics card before running your kernel.
I did it like this:

int device = -1; cudaGetDevice( &device ); cudaMemPrefetchAsync( x.data(), x.length() * sizeof( float ), device, nullptr );

In the code above, 'x' is from a class I defined.
I basically wrapped the code presented in this article inside a class and added somethings as to have an interface similar to std::vector.

So you just need to execute this before running your kernel. (Edit it to fit your code, of course)

Topic		Replies	Views
An Easy Introduction to CUDA C and C++ Technical Blog	48	1225	July 19, 2018
Unified Memory for CUDA Beginners Technical Blog	46	2561	December 1, 2023
CUDA very slow performance CUDA Programming and Performance	21	16692	March 6, 2020
Unified Memory in CUDA 6 Technical Blog	87	1899	August 16, 2019
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134568	May 26, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9579	December 4, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8143	February 17, 2015
CUDA 8 Features Revealed Technical Blog	51	862	November 8, 2018
Using unified memory causes system crash CUDA Programming and Performance	28	5862	February 4, 2019

An Even Easier Introduction to CUDA

Related topics