An Even Easier Introduction to CUDA

Thanks for the explanation.
Now the call execution time went down to less than 100us

In Windows 10 I get this error:

nvcc fatal : Cannot find compiler 'cl.exe' in PATH

I have Visual Studio Community Edition installed and the NVidia samples work from within VSCE, but this example doesn't work from the command line.

What am I missing?

Finally! The script was full of red error but silly me tried other cuda header files. It seems that this doc need a slight update for VS 2019.

I am not sure if this problem is related to jetson-nano or nvprof for CUDA in general.
But when I try to profile the 1 thread version of the vector add program, system freezes.

sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./vector_add_cuda1

I have a jetson nano B1 with 4GB with latest L4T (all sw upto date).

But if i remove the --metrics it runs well:

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./vector_add_cuda1
==8390== NVPROF is profiling process 8390, command: ./vector_add_cuda1
==8390== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Max error: 0
==8390== Profiling application: ./vector_add_cuda1
==8390== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  215.38ms         1  215.38ms  215.38ms  215.38ms  vector_add(int, float*, float*)

Where as the 256 thread version I am able to profile and get the metrics as well:

Anyone has seen this problem before ? What could be the issue ?

Hello Mark,

Correct me if wrong, but the stride and the loop in the last execution configuration (<<<4096,256>>>) would be redundant since we configure the same number of threads as the vector size.

const int blockSize = 256;
const int numBlocks = (N + blockSize - 1) / blockSize;

printf("N = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", N, numBlocks, blockSize * numBlocks);

This prints:

N = 1048576. With blockSize = 256, numBlocks = 4096. Total Number of threads = 1048576

So i change the kernel to :

global void vector_add(int n, float *x, float *y)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;
y[index] = x[index] + y[index];
}

And it works fine (maybe around 0.1ms faster).

I just made a kernel to use the maximum number of threads supported on the GPU like this

cudaDeviceProp deviceProp;
int dev = 0;

checkCuda(cudaSetDevice(dev));
checkCuda(cudaGetDeviceProperties(&deviceProp, dev));

const int MaxThreads =  deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor;

const int blockSize = 256;
const int numBlocks = (MaxThreads + (blockSize - 1)) / blockSize;

printf("MaxThreads = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", MaxThreads, numBlocks, blockSize * numBlocks);

// Run kernel on 1M elements on the GPU
vector_add<<<numBlocks, blockSize>>>(N, x, y);

And use the same for loop in the kernel and it is around 0.5ms faster now

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof  ./vector_add_cuda4
..
MaxThreads = 2048. With blockSize = 256, numBlocks = 8. Total Number of threads = 2048
Max error: 0
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.0879ms         1  1.0879ms  1.0879ms  1.0879ms  vector_add(int, float*, float*

Indeed, in this example a grid-stride loop is not required, because the data size is small. But with large data size, you may need a grid stride loop, as mentioned in the post on grid-stride loops linked from the post we are discussing.

BTW, 2048 seems like a very small value for the maximum threads a device would support. I suspect there’s a problem in your code. deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor should be something much larger than 2048.

If you are processing 1048576 elements with 2048 threads and not looping inside the kernel, you are not processing all of your data (which would indeed be faster than processing all of the data).

I am running on the jetson nano, probably the smallest NVIDIA GPU :)! Here is my partial deviceQuery output:

rreddy78@jetson-nano:~/Desktop/Technical/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3956 MBytes (4148293632 bytes)
  ( 1) Multiprocessors , (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048

I am processing all the data with a the same for loop and stride as in example case. I get maxError = 0 output.
I just suspect that its faster because there is no overhead of blocks being retired and new blocks being scheduled on the SM. And also each thread does less work in the example case (handles just one element)

Hello I’m getting the same error as anon12585791:

for (int i = 0; i < N; i++) {
x[i] = 1.0f; <----- here it stops exception thrown

Exception thrown at 0x00007FF6E79D0F79 in a.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

I am running the following:
Windows 10 1909 , 64 bit
Microsoft Visual Studio Community 2019
Version 16.8.3
CUDA 11.2.0
EVGA GTX 1080 GPU

I have confirmed that my project is indeed set to x64 but I still get an exception on the code listed above. The sample files that came with cuda 11 install compile and run fine so it appears to be installed correctly. Can someone advise why the example doesn’t work on my system?

It’s Sept 2021 and I got the same problem running the Notebook. I am using a Tesla K80. Similarly, it was solved by doing

%%shell

nvcc add.cu --gpu-architecture sm_37 -o add_cuda
./add_cud

nvcc warning : The ā€˜compute_35’, ā€˜compute_37’, ā€˜compute_50’, ā€˜sm_35’, ā€˜sm_37’ and ā€˜sm_50’ architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Max error: 0

Same crash point. I tried your solution, but it had no effect on the crash. I am using a home built box with NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] and Kubuntu 20.04. There is no ā€œarchitectureā€ assigned to this card on its spec page: GeForce GTX 1050 Ti | Specifications | GeForce

Curious, nvcc man pages don’t even have a --gpu-architecture command line option? (% man ā€˜nvcc’)

I think the problem is with the cudaMallocManaged() a few lines earlier.

When the hello-world application doesn’t ā€œjust workā€ right out the box, then I guess we all have to settle in for a long, painful bleeding-edge experience. (CUDA on Linux in 2021. I thought I had waited long enough!)

Peter Leopold
Portland ME and Cambridge MA

Thank you for the reply. I changed the assigned architecture to sm_35 (if I remember it properly) and it ran perfectly. I found it to be strange that it compiled with sm_37 and returned a different result.

It seems that the reason why --gpu_architecture is not in ā€œman nvccā€ is because it has been deprecated, so it is gone from the man pages before it is gone from the technical spec! I think that is a first in my experience!

In any event, I used all possible values of sm_xx and computer_xx and nothing works. There is a nasty run-time bug in the cudaMallocManaged that cannot be ā€œarchitecturedā€ away. You have been able to avoid it, but I have not. Do you think nvidia will release the source code for cudaMallocManaged? ;)

I have a kind of a resolution. I followed Mark’s advice and printed out the error msg from the suspected malloc command. It suggested an incompatabliity between the drivers and the toolchain. I then used base ubuntu to install the latest toolchain, not the deb file downloaded from nvidia, which was blocked from installing for reasons not given in the error msg. The base ubuntu drivers were also blocked, but the error messages revealed the need to upgrade the drivers, which were blocked because of the need to upgrade the xserver and several auxiliary files. I started down that rabbit hole and had some success, although in one a linux image file was deployed, disconcertingly. Finally until one package simply didn’t install at all.

Here is a history:
1981 sudo apt-cache search cuda
1982 sudo apt-get install cuda-ll-4
1983 sudo apt-get install cuda-11-4
1984 sudo apt-get install cuda-11-3
1985 sudo apt-get install cuda-runtime-11-4
1986 sudo apt-get install cuda-drivers
1987 sudo apt-get install cuda-drivers-470
1988 sudo apt-get install nvidia-driver-470
1989 sudo apt-get install xserver-xorg-video-nvidia-470
1990 sudo apt-get install nvidia-driver-470
1991 sudo apt-get install cuda-drivers-470
1992 sudo apt-get install nvidia-settings
1993 sudo apt-get install nvidia-alternative
1994 sudo apt-get install glx-alternative-nvidia ← This is the one that just blocked.

1995 sudo apt-get install nvidia-legacy-check ← This one was recommended. It uninstalled many packages.

Disappointed, I decided to re-check the malloc error statement again for new clues. This time it just worked. So . . . I guess I fixed it.

I’m running the latest kubuntu (as of September 30, 2021) with a home built box and GP107 [GeForce GTX 1050 Ti] video card and layers of installed and uninstalled drivers whose exact configuration is now probably irreproducible. But the random walk through possibility space got me here. Mark’s – add.cu, the ā€œHello worldā€ of cuda programming – actually works.

please make it super clear there is a double underscore for global
i lost a lot of time only using one on each side

Hi @hendersonmotorcycles, and thanks for your comment. Correct, __global__, like many C++ extension keywords, uses double underscores.

Copying and pasting the code from the post rather than typing it is likely to result in fewer typos. Or use an IDE like vscode with a CUDA C++ extension (or a CUDA C++ extension for another editor) and it will recognize the misspelled keyword and highlight it for you. (Here’s another one for VIM, for example: GitHub - bfrg/vim-cuda-syntax: CUDA syntax highlighting for Vim)

Cheers,
Mark

There seems to be a chunk of text missing from the article.
I’m using MSVC to compile the first piece of code without any CUDA material. All works fine.
I modified the code to run CUDA and VS highlighted all the CUDA code as syntax errors as I expected.
I have 20 errors, all relating to the changes required to run CUDA.
Clearly this is a toolchain problem, not made clear in the article.
I’m running VS2022 and have installed NSight extension.
I’ve installed NVIDIA CUDA Development 11.7, drivers, etc.
Has anyone a clue what to do next?

Sorry for the delay, @nvidia1356. I just got a recommendation that the following forum would be a good place for you to ask for help with setting up CUDA on Windows:

[CUDA Setup and Installation - NVIDIA Developer Forums]

Hope it helps!

Thanks for the material!. A small comment

N = 1<<20 (bin) or 1048576(int)
1048576 / 256 = 4096

Therefore if the author aimed to use 4096 blocks as indicated in the figure, the addition of one block decreased by one unit (+ blockSize - 1) is not necessary. Please correct me if I’m wrong.

As I mentioned in the post:

I simply divide N by the block size (being careful to round up in case N is not a multiple of blockSize).

:)

If anyone with a newer GPU runs the profiler command as mentioned in the tutorial like this:

nvprof ./add_cuda

And you get this warning:

======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
                  Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
                  Refer https://developer.nvidia.com/tools-overview for more details.

To resolve this, instead use the new profiler tool, like this:

nsys profile --stats=true ./add_cuda