An Even Easier Introduction to CUDA

anon95858191 · February 4, 2020, 3:06pm

Thanks for the explanation.
Now the call execution time went down to less than 100us

anon54008752 · February 6, 2020, 4:39pm

In Windows 10 I get this error:

nvcc fatal : Cannot find compiler 'cl.exe' in PATH

I have Visual Studio Community Edition installed and the NVidia samples work from within VSCE, but this example doesn't work from the command line.

What am I missing?

anon51241695 · March 14, 2020, 5:40pm

Finally! The script was full of red error but silly me tried other cuda header files. It seems that this doc need a slight update for VS 2019.

rreddy78 · November 11, 2020, 2:07pm

I am not sure if this problem is related to jetson-nano or nvprof for CUDA in general.
But when I try to profile the 1 thread version of the vector add program, system freezes.

sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./vector_add_cuda1

I have a jetson nano B1 with 4GB with latest L4T (all sw upto date).

But if i remove the --metrics it runs well:

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./vector_add_cuda1
==8390== NVPROF is profiling process 8390, command: ./vector_add_cuda1
==8390== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Max error: 0
==8390== Profiling application: ./vector_add_cuda1
==8390== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  215.38ms         1  215.38ms  215.38ms  215.38ms  vector_add(int, float*, float*)

Where as the 256 thread version I am able to profile and get the metrics as well:

Anyone has seen this problem before ? What could be the issue ?

rreddy78 · November 11, 2020, 3:22pm

Hello Mark,

Correct me if wrong, but the stride and the loop in the last execution configuration (<<<4096,256>>>) would be redundant since we configure the same number of threads as the vector size.

const int blockSize = 256;
const int numBlocks = (N + blockSize - 1) / blockSize;

printf("N = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", N, numBlocks, blockSize * numBlocks);

This prints:

N = 1048576. With blockSize = 256, numBlocks = 4096. Total Number of threads = 1048576

So i change the kernel to :

global void vector_add(int n, float *x, float *y)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;
y[index] = x[index] + y[index];
}

And it works fine (maybe around 0.1ms faster).

I just made a kernel to use the maximum number of threads supported on the GPU like this

cudaDeviceProp deviceProp;
int dev = 0;

checkCuda(cudaSetDevice(dev));
checkCuda(cudaGetDeviceProperties(&deviceProp, dev));

const int MaxThreads =  deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor;

const int blockSize = 256;
const int numBlocks = (MaxThreads + (blockSize - 1)) / blockSize;

printf("MaxThreads = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", MaxThreads, numBlocks, blockSize * numBlocks);

// Run kernel on 1M elements on the GPU
vector_add<<<numBlocks, blockSize>>>(N, x, y);

And use the same for loop in the kernel and it is around 0.5ms faster now

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof  ./vector_add_cuda4
..
MaxThreads = 2048. With blockSize = 256, numBlocks = 8. Total Number of threads = 2048
Max error: 0
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.0879ms         1  1.0879ms  1.0879ms  1.0879ms  vector_add(int, float*, float*

Mark_Harris · November 12, 2020, 3:36am

Indeed, in this example a grid-stride loop is not required, because the data size is small. But with large data size, you may need a grid stride loop, as mentioned in the post on grid-stride loops linked from the post we are discussing.

BTW, 2048 seems like a very small value for the maximum threads a device would support. I suspect there’s a problem in your code. deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor should be something much larger than 2048.

If you are processing 1048576 elements with 2048 threads and not looping inside the kernel, you are not processing all of your data (which would indeed be faster than processing all of the data).

rreddy78 · November 12, 2020, 7:05am

I am running on the jetson nano, probably the smallest NVIDIA GPU :)! Here is my partial deviceQuery output:

rreddy78@jetson-nano:~/Desktop/Technical/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3956 MBytes (4148293632 bytes)
  ( 1) Multiprocessors , (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048

I am processing all the data with a the same for loop and stride as in example case. I get maxError = 0 output.
I just suspect that its faster because there is no overhead of blocks being retired and new blocks being scheduled on the SM. And also each thread does less work in the example case (handles just one element)

mcowley · December 30, 2020, 9:03am

Hello I’m getting the same error as anon12585791:

for (int i = 0; i < N; i++) {
x[i] = 1.0f; <----- here it stops exception thrown

Exception thrown at 0x00007FF6E79D0F79 in a.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

I am running the following:
Windows 10 1909 , 64 bit
Microsoft Visual Studio Community 2019
Version 16.8.3
CUDA 11.2.0
EVGA GTX 1080 GPU

I have confirmed that my project is indeed set to x64 but I still get an exception on the code listed above. The sample files that came with cuda 11 install compile and run fine so it appears to be installed correctly. Can someone advise why the example doesn’t work on my system?

eduardo4jesus · September 23, 2021, 5:46pm

It’s Sept 2021 and I got the same problem running the Notebook. I am using a Tesla K80. Similarly, it was solved by doing

%%shell

nvcc add.cu --gpu-architecture sm_37 -o add_cuda
./add_cud

nvcc warning : The ‘compute_35’, ‘compute_37’, ‘compute_50’, ‘sm_35’, ‘sm_37’ and ‘sm_50’ architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Max error: 0

peterleopold23 · September 29, 2021, 10:19pm

Same crash point. I tried your solution, but it had no effect on the crash. I am using a home built box with NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] and Kubuntu 20.04. There is no “architecture” assigned to this card on its spec page: GeForce GTX 1050 Ti | Specifications | GeForce

Curious, nvcc man pages don’t even have a --gpu-architecture command line option? (% man ‘nvcc’)

I think the problem is with the cudaMallocManaged() a few lines earlier.

When the hello-world application doesn’t “just work” right out the box, then I guess we all have to settle in for a long, painful bleeding-edge experience. (CUDA on Linux in 2021. I thought I had waited long enough!)

Peter Leopold
Portland ME and Cambridge MA

eduardo4jesus · September 29, 2021, 10:56pm

Thank you for the reply. I changed the assigned architecture to sm_35 (if I remember it properly) and it ran perfectly. I found it to be strange that it compiled with sm_37 and returned a different result.

peterleopold23 · September 29, 2021, 11:17pm

It seems that the reason why --gpu_architecture is not in “man nvcc” is because it has been deprecated, so it is gone from the man pages before it is gone from the technical spec! I think that is a first in my experience!

In any event, I used all possible values of sm_xx and computer_xx and nothing works. There is a nasty run-time bug in the cudaMallocManaged that cannot be “architectured” away. You have been able to avoid it, but I have not. Do you think nvidia will release the source code for cudaMallocManaged? ;)

peterleopold23 · September 30, 2021, 12:35pm

I have a kind of a resolution. I followed Mark’s advice and printed out the error msg from the suspected malloc command. It suggested an incompatabliity between the drivers and the toolchain. I then used base ubuntu to install the latest toolchain, not the deb file downloaded from nvidia, which was blocked from installing for reasons not given in the error msg. The base ubuntu drivers were also blocked, but the error messages revealed the need to upgrade the drivers, which were blocked because of the need to upgrade the xserver and several auxiliary files. I started down that rabbit hole and had some success, although in one a linux image file was deployed, disconcertingly. Finally until one package simply didn’t install at all.

Here is a history:
1981 sudo apt-cache search cuda
1982 sudo apt-get install cuda-ll-4
1983 sudo apt-get install cuda-11-4
1984 sudo apt-get install cuda-11-3
1985 sudo apt-get install cuda-runtime-11-4
1986 sudo apt-get install cuda-drivers
1987 sudo apt-get install cuda-drivers-470
1988 sudo apt-get install nvidia-driver-470
1989 sudo apt-get install xserver-xorg-video-nvidia-470
1990 sudo apt-get install nvidia-driver-470
1991 sudo apt-get install cuda-drivers-470
1992 sudo apt-get install nvidia-settings
1993 sudo apt-get install nvidia-alternative
1994 sudo apt-get install glx-alternative-nvidia ← This is the one that just blocked.

1995 sudo apt-get install nvidia-legacy-check ← This one was recommended. It uninstalled many packages.

Disappointed, I decided to re-check the malloc error statement again for new clues. This time it just worked. So . . . I guess I fixed it.

I’m running the latest kubuntu (as of September 30, 2021) with a home built box and GP107 [GeForce GTX 1050 Ti] video card and layers of installed and uninstalled drivers whose exact configuration is now probably irreproducible. But the random walk through possibility space got me here. Mark’s – add.cu, the “Hello world” of cuda programming – actually works.

hendersonmotorcycles · December 23, 2021, 9:39pm

please make it super clear there is a double underscore for global
i lost a lot of time only using one on each side

Mark_Harris · January 11, 2022, 4:22am

Hi @hendersonmotorcycles, and thanks for your comment. Correct, __global__, like many C++ extension keywords, uses double underscores.

Copying and pasting the code from the post rather than typing it is likely to result in fewer typos. Or use an IDE like vscode with a CUDA C++ extension (or a CUDA C++ extension for another editor) and it will recognize the misspelled keyword and highlight it for you. (Here’s another one for VIM, for example: GitHub - bfrg/vim-cuda-syntax: CUDA syntax highlighting for Vim)

Cheers,
Mark

nvidia1356 · June 8, 2022, 3:29pm

There seems to be a chunk of text missing from the article.
I’m using MSVC to compile the first piece of code without any CUDA material. All works fine.
I modified the code to run CUDA and VS highlighted all the CUDA code as syntax errors as I expected.
I have 20 errors, all relating to the changes required to run CUDA.
Clearly this is a toolchain problem, not made clear in the article.
I’m running VS2022 and have installed NSight extension.
I’ve installed NVIDIA CUDA Development 11.7, drivers, etc.
Has anyone a clue what to do next?

jwitsoe · July 13, 2022, 5:15pm

Sorry for the delay, @nvidia1356. I just got a recommendation that the following forum would be a good place for you to ask for help with setting up CUDA on Windows:

[CUDA Setup and Installation - NVIDIA Developer Forums]

Hope it helps!

Camilo.chiang · July 29, 2022, 9:10am

Thanks for the material!. A small comment

N = 1<<20 (bin) or 1048576(int)
1048576 / 256 = 4096

Therefore if the author aimed to use 4096 blocks as indicated in the figure, the addition of one block decreased by one unit (+ blockSize - 1) is not necessary. Please correct me if I’m wrong.

Mark_Harris · August 22, 2022, 9:38pm

As I mentioned in the post:

I simply divide N by the block size (being careful to round up in case N is not a multiple of blockSize).

:)

cdahms123 · July 23, 2023, 5:39pm

If anyone with a newer GPU runs the profiler command as mentioned in the tutorial like this:

nvprof ./add_cuda

And you get this warning:

======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
                  Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
                  Refer https://developer.nvidia.com/tools-overview for more details.

To resolve this, instead use the new profiler tool, like this:

nsys profile --stats=true ./add_cuda

Topic		Replies	Views
더욱 쉬워진 CUDA 입문 Technical Blog - South Korea	0	127	September 6, 2024
Cuda program taking more time. CUDA Programming and Performance	15	7236	November 21, 2010
Unified Memory for CUDA Beginners Technical Blog	46	3302	December 1, 2023
Parallelism On Multiple Blocks Seems Broken CUDA Programming and Performance	6	931	March 14, 2021
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	54470	June 7, 2022
An Easy Introduction to CUDA C and C++ Technical Blog	48	1855	July 19, 2018
Low or normal performance? CUDA Programming and Performance cuda	20	1510	November 13, 2020
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	137086	May 26, 2010
Wrong output when adding blocks what am I doing wrong? CUDA Programming and Performance	13	12171	December 4, 2007
Cuda code performance CUDA Programming and Performance	14	3367	December 16, 2014

An Even Easier Introduction to CUDA

Related topics