Thank you very much for this CUDA introduction page.
On the first read I have also skipped the grid-stride-loop section and missed the point of the kernel loop... Maybe it could be emphasized a little more ;)
Thank you very much for this CUDA introduction page.
On the first read I have also skipped the grid-stride-loop section and missed the point of the kernel loop... Maybe it could be emphasized a little more ;)
Use Ubuntu for development instead of that Microsoft shit.
Is it so that in this grid-stride loop you don't actually need the for structure at all, because it runs only one step (the current index) per thread?
This, but on Ubuntu 18.04 with GeForce 750 Ti. I tried running the code on the other site but it gave no output at all.
UPDATE: Nevermind! Just had to reboot the system..
Actually some of the code on the older version of this post is working, but the 750 is based on Maxwell (post Kepler) so still have to learn why the add_cuda code isn't working.
Thank for the neat tutorial.
But you didn't talk about (N+blockSize-1)/blockSize while a line above that, you said about ROUNDUP(N/blockSize)
The comment "being careful to round up" is referring specifically to `(N+blockSize-1)/blockSize`, because that's exactly what that code does.
Not in Visual Studio. Also, you should comment this in the text!
I assume you mean Visual Studio on Windows, since I don't have any issue with this on Visual Studio Code on Linux. NVCC's behavior regarding CUDA headers should be the same across OSes. If it's not, I suggest you ask on devtalk.nvidia.com
I'm using VS 2017 and CUDA 10.1. I was able to compile your sample only after including that line.
Wow, that is agreat gain in time. I appreciate your explanation which was really easy to follow and understand . Thanks!
You say "a Tesla P100 GPU... has 56 SMs, each capable of supporting up to 2048 active threads" but the table in the linked page says there are only 64 "FP32 CUDA Cores / SM" and 32 "FP64 CUDA Cores / SM" for a total of 96 cores per SM. And it's just 1 thread per core, right? What am I missing?
Active threads are not the same as instruction throughput. A thread is an execution context. So the SM has enough resources to maintain the state of 2048 active threads (64 warps). The SM issues and retires instructions from one or more warps per cycle. So you have floating point units capable of 2 FP32 warp-instructions per cycle (64 FMA per cycle across two warps). Or 1 FP64 warp-instruction per cycle. These may be issued from any of the up to 64 resident warps that have an instruction ready at a given cycle. Of course, there's more to a program than floating point, so the SMs also have units for executing integer math, branches, logic ops, memory loads and stores, etc, each with its own maximum throughput.
I see. Thanks for the reply. That's the ambiguity of the word "thread" I suppose. Since you mentioned integers, one more question: Have you known anyone to benefit from GPUs (in terms of calculations-per-dollar) with computation that is heavy on 128-bit integer arithmetic? I understand that the emulation required for such oversized integers comes at a high price.
Hi, Mark!
First of all, thank you for this post, it helped me start with cuda programming.
But I have some questions.
Did you forget some line of code when transitioning from 1 block to multiple blocks?
On my tests, 1 block results in around 3 ms to process the 1 million sums, but changing to multiple blocks had no effect at all.
It's still taking the same time as 1 block.
Maybe it's something on my config?
I'm on Manjaro linux. With Cuda 10.1.243 and nvidia-430xx drivers.
GTX 1060 3GB
Another thing I noticed is that the stride loop on the multiple blocks example is unnecessary, since the total amount of threads match the array size.
Is there a way to create the threads to always match the array size exactly or is it safer to always use the stride-loop?
Hi, Fergal,
I also didn't get a speed up when adding more blocks.
Did preloading the memory improve the run time for you?
Thanks.
Nevermind.
I found your other post explaining about synchronizing the memory before executing the kernel and it now runs at ~80 us with multiple blocks.
Thank you for doing these posts!
This is also my observation. If you print value of 'stride' inside of the kernel, it's the same as N (for the provided example). It seems that numBlocks value would have to be reduced in order for the 'for' loop to run more than a single iteration per thread. Can you clarify this?
Hi, I am having the same results(GTX 1060), how did you get over this?
Hello, I forgot which post he talks about this, but the gist of the problem is that (at least based on the code presented here) the graphics card only loads the values from RAM when you tell it to run your kernel and it tries to load the values. So every time a value from the array is needed, the graphics card stalls waiting for the value to come from RAM.
So you need to actively synchronize the values to the graphics card before running your kernel.
I did it like this:
int device = -1;
cudaGetDevice( &device );
cudaMemPrefetchAsync( x.data(), x.length() * sizeof( float ), device, nullptr );
In the code above, 'x' is from a class I defined.
I basically wrapped the code presented in this article inside a class and added somethings as to have an interface similar to std::vector.
So you just need to execute this before running your kernel. (Edit it to fit your code, of course)