Thanks for the explanation.
Now the call execution time went down to less than 100us
In Windows 10 I get this error:
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
I have Visual Studio Community Edition installed and the NVidia samples work from within VSCE, but this example doesn't work from the command line.
What am I missing?
Finally! The script was full of red error but silly me tried other cuda header files. It seems that this doc need a slight update for VS 2019.
I am not sure if this problem is related to jetson-nano or nvprof for CUDA in general.
But when I try to profile the 1 thread version of the vector add program, system freezes.
sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./vector_add_cuda1
I have a jetson nano B1 with 4GB with latest L4T (all sw upto date).
But if i remove the --metrics it runs well:
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./vector_add_cuda1
==8390== NVPROF is profiling process 8390, command: ./vector_add_cuda1
==8390== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Max error: 0
==8390== Profiling application: ./vector_add_cuda1
==8390== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 215.38ms 1 215.38ms 215.38ms 215.38ms vector_add(int, float*, float*)
Where as the 256 thread version I am able to profile and get the metrics as well:
Anyone has seen this problem before ? What could be the issue ?
Hello Mark,
Correct me if wrong, but the stride and the loop in the last execution configuration (<<<4096,256>>>) would be redundant since we configure the same number of threads as the vector size.
const int blockSize = 256;
const int numBlocks = (N + blockSize - 1) / blockSize;
printf("N = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", N, numBlocks, blockSize * numBlocks);
This prints:
N = 1048576. With blockSize = 256, numBlocks = 4096. Total Number of threads = 1048576
So i change the kernel to :
global void vector_add(int n, float *x, float *y)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;
y[index] = x[index] + y[index];
}
And it works fine (maybe around 0.1ms faster).
I just made a kernel to use the maximum number of threads supported on the GPU like this
cudaDeviceProp deviceProp;
int dev = 0;
checkCuda(cudaSetDevice(dev));
checkCuda(cudaGetDeviceProperties(&deviceProp, dev));
const int MaxThreads = deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor;
const int blockSize = 256;
const int numBlocks = (MaxThreads + (blockSize - 1)) / blockSize;
printf("MaxThreads = %d. With blockSize = 256, numBlocks = %d. Total Number of threads = %d\n", MaxThreads, numBlocks, blockSize * numBlocks);
// Run kernel on 1M elements on the GPU
vector_add<<<numBlocks, blockSize>>>(N, x, y);
And use the same for loop in the kernel and it is around 0.5ms faster now
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./vector_add_cuda4
..
MaxThreads = 2048. With blockSize = 256, numBlocks = 8. Total Number of threads = 2048
Max error: 0
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.0879ms 1 1.0879ms 1.0879ms 1.0879ms vector_add(int, float*, float*
Indeed, in this example a grid-stride loop is not required, because the data size is small. But with large data size, you may need a grid stride loop, as mentioned in the post on grid-stride loops linked from the post we are discussing.
BTW, 2048 seems like a very small value for the maximum threads a device would support. I suspect thereās a problem in your code. deviceProp.multiProcessorCount * deviceProp.maxThreadsPerMultiProcessor should be something much larger than 2048.
If you are processing 1048576 elements with 2048 threads and not looping inside the kernel, you are not processing all of your data (which would indeed be faster than processing all of the data).
I am running on the jetson nano, probably the smallest NVIDIA GPU :)! Here is my partial deviceQuery output:
rreddy78@jetson-nano:~/Desktop/Technical/deviceQuery$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X1" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 5.3 Total amount of global memory: 3956 MBytes (4148293632 bytes) ( 1) Multiprocessors , (128) CUDA Cores/MP: 128 CUDA Cores GPU Max Clock rate: 922 MHz (0.92 GHz) Memory Clock rate: 13 Mhz Memory Bus Width: 64-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048
I am processing all the data with a the same for loop and stride as in example case. I get maxError = 0 output.
I just suspect that its faster because there is no overhead of blocks being retired and new blocks being scheduled on the SM. And also each thread does less work in the example case (handles just one element)
Hello Iām getting the same error as anon12585791:
for (int i = 0; i < N; i++) {
x[i] = 1.0f; <----- here it stops exception thrown
Exception thrown at 0x00007FF6E79D0F79 in a.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.
I am running the following:
Windows 10 1909 , 64 bit
Microsoft Visual Studio Community 2019
Version 16.8.3
CUDA 11.2.0
EVGA GTX 1080 GPU
I have confirmed that my project is indeed set to x64 but I still get an exception on the code listed above. The sample files that came with cuda 11 install compile and run fine so it appears to be installed correctly. Can someone advise why the example doesnāt work on my system?
Itās Sept 2021 and I got the same problem running the Notebook. I am using a Tesla K80. Similarly, it was solved by doing
%%shell
nvcc add.cu --gpu-architecture sm_37 -o add_cuda
./add_cud
nvcc warning : The ācompute_35ā, ācompute_37ā, ācompute_50ā, āsm_35ā, āsm_37ā and āsm_50ā architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Max error: 0
Same crash point. I tried your solution, but it had no effect on the crash. I am using a home built box with NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] and Kubuntu 20.04. There is no āarchitectureā assigned to this card on its spec page: GeForce GTX 1050 Ti | Specifications | GeForce
Curious, nvcc man pages donāt even have a --gpu-architecture command line option? (% man ānvccā)
I think the problem is with the cudaMallocManaged() a few lines earlier.
When the hello-world application doesnāt ājust workā right out the box, then I guess we all have to settle in for a long, painful bleeding-edge experience. (CUDA on Linux in 2021. I thought I had waited long enough!)
Peter Leopold
Portland ME and Cambridge MA
Thank you for the reply. I changed the assigned architecture to sm_35 (if I remember it properly) and it ran perfectly. I found it to be strange that it compiled with sm_37 and returned a different result.
It seems that the reason why --gpu_architecture is not in āman nvccā is because it has been deprecated, so it is gone from the man pages before it is gone from the technical spec! I think that is a first in my experience!
In any event, I used all possible values of sm_xx and computer_xx and nothing works. There is a nasty run-time bug in the cudaMallocManaged that cannot be āarchitecturedā away. You have been able to avoid it, but I have not. Do you think nvidia will release the source code for cudaMallocManaged? ;)
I have a kind of a resolution. I followed Markās advice and printed out the error msg from the suspected malloc command. It suggested an incompatabliity between the drivers and the toolchain. I then used base ubuntu to install the latest toolchain, not the deb file downloaded from nvidia, which was blocked from installing for reasons not given in the error msg. The base ubuntu drivers were also blocked, but the error messages revealed the need to upgrade the drivers, which were blocked because of the need to upgrade the xserver and several auxiliary files. I started down that rabbit hole and had some success, although in one a linux image file was deployed, disconcertingly. Finally until one package simply didnāt install at all.
Here is a history:
1981 sudo apt-cache search cuda
1982 sudo apt-get install cuda-ll-4
1983 sudo apt-get install cuda-11-4
1984 sudo apt-get install cuda-11-3
1985 sudo apt-get install cuda-runtime-11-4
1986 sudo apt-get install cuda-drivers
1987 sudo apt-get install cuda-drivers-470
1988 sudo apt-get install nvidia-driver-470
1989 sudo apt-get install xserver-xorg-video-nvidia-470
1990 sudo apt-get install nvidia-driver-470
1991 sudo apt-get install cuda-drivers-470
1992 sudo apt-get install nvidia-settings
1993 sudo apt-get install nvidia-alternative
1994 sudo apt-get install glx-alternative-nvidia ā This is the one that just blocked.
1995 sudo apt-get install nvidia-legacy-check ā This one was recommended. It uninstalled many packages.
Disappointed, I decided to re-check the malloc error statement again for new clues. This time it just worked. So . . . I guess I fixed it.
Iām running the latest kubuntu (as of September 30, 2021) with a home built box and GP107 [GeForce GTX 1050 Ti] video card and layers of installed and uninstalled drivers whose exact configuration is now probably irreproducible. But the random walk through possibility space got me here. Markās ā add.cu, the āHello worldā of cuda programming ā actually works.
please make it super clear there is a double underscore for global
i lost a lot of time only using one on each side
Hi @hendersonmotorcycles, and thanks for your comment. Correct, __global__, like many C++ extension keywords, uses double underscores.
Copying and pasting the code from the post rather than typing it is likely to result in fewer typos. Or use an IDE like vscode with a CUDA C++ extension (or a CUDA C++ extension for another editor) and it will recognize the misspelled keyword and highlight it for you. (Hereās another one for VIM, for example: GitHub - bfrg/vim-cuda-syntax: CUDA syntax highlighting for Vim)
Cheers,
Mark
There seems to be a chunk of text missing from the article.
Iām using MSVC to compile the first piece of code without any CUDA material. All works fine.
I modified the code to run CUDA and VS highlighted all the CUDA code as syntax errors as I expected.
I have 20 errors, all relating to the changes required to run CUDA.
Clearly this is a toolchain problem, not made clear in the article.
Iām running VS2022 and have installed NSight extension.
Iāve installed NVIDIA CUDA Development 11.7, drivers, etc.
Has anyone a clue what to do next?
Sorry for the delay, @nvidia1356. I just got a recommendation that the following forum would be a good place for you to ask for help with setting up CUDA on Windows:
[CUDA Setup and Installation - NVIDIA Developer Forums]
Hope it helps!
Thanks for the material!. A small comment
N = 1<<20 (bin) or 1048576(int)
1048576 / 256 = 4096
Therefore if the author aimed to use 4096 blocks as indicated in the figure, the addition of one block decreased by one unit (+ blockSize - 1) is not necessary. Please correct me if Iām wrong.
As I mentioned in the post:
I simply divide
Nby the block size (being careful to round up in caseNis not a multiple ofblockSize).
:)
If anyone with a newer GPU runs the profiler command as mentioned in the tutorial like this:
nvprof ./add_cuda
And you get this warning:
======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
Refer https://developer.nvidia.com/tools-overview for more details.
To resolve this, instead use the new profiler tool, like this:
nsys profile --stats=true ./add_cuda