The code does appear to be working, but only when I switch over to NVIDIA Web Drivers from the OS X default drivers, and I have to execute the code as root for it to not return a "no cuda compatible device detected" error. When using the NVIDIA web drivers, the CUDA drivers say it is up to date, but when I'm using the OS X driver, the CUDA driver says it needs to be updated. Its very weird behavior I know, and to add to that when NVIDIA web drivers is enabled sometimes my computer screen will go blank when I'm doing more graphically intensive computing. I think this happens when it switches from the integrated graphics card to the NVIDIA graphics card. Thank you for the help and hopefully you can shed some light on this.
Very gratifying post until I got to the last example. My numbers are 198ms, 4.03ms, and 3.05ms, respectively. I have a "GeForce GTX 1050 Ti (0)" - are these numbers sane for the difference in hardware? And my run on a Pascal "Quadro P5000" is not much better: 188ms, 2.9ms, 2.3ms.
Both of those are Pascal GPUs, so you are effectively taking on exercise #4 in the post. :) This has been discussed in these comments elsewhere, but as Exercise #4 suggests, you should read my follow up post "Unified Memory for CUDA Beginners" to understand and improve the performance you are seeing. https://devblogs.nvidia.com...
Here's the followup: "Unified Memory for CUDA Beginners": https://devblogs.nvidia.com...
You're a god. That CUDA-based file indexer+searcher+browser+parser has just made Windows Explorer Search Obsolete for me and I have long since longed for this moment. Thank you.
Thanks for the comment .
Pls keep in mind to keep refreshing the index every so often as the program searches on basis of snapshot taken of the harddrive .
In Fig.1
girdDim.x should be 16.
I have Nvidia Shield Tablet (Nvidia Tegra K1) . And i want programming cuda on Android. Could you help me?
Actually 4096 as used in the figure is correct, since blockIdx.x ranges from 0 to 4095. Thanks!
Hi Mark,
Great blog post! I had a bit of trouble at 2 stages of your post:
Running with a GTX 1080:
* Tried profiling, got "no kernels were profiled"
* Searched nvidia forums for what's causing this, found that adding the '--unified-memory-profiling off' flag to nvprof fixes this.
* Adding many cuda blocks gives me no speedup??
* Found your blog post on unified memory, added code to preload memory into the GPU.
* Removed '--unified-memory-profiling off' flag
If this is meant to happen, I would suggest adding a little comment about newer GPU models earlier in this blog post and mentioning the '--unified-memory-profiling' flag. It looks like you wanted us to find this out by trying it, which now I'm glad I have, but I spent perhaps a little too long with some of these issues.
Hi Mark,
Thank you very much for this tutorial. I followed it and got the same results on a K80. It would be great how to learn how the same code could take advantage of the second of the 2 GPUs on the K80, as you allude at the end of the article. In running a direct comparison of your K80 example to the same (nonparallel) clang++ compilation on a MacBook Pro 2.8 GHz Intel Core i7 (mid 2015 model), a single processor is able to do the same in about 700 us, so "only" about 7x slower. It may be about the same if we were to compare to the 8 processes available on the laptop and write an MPI version. So, this seems to me that the GPU, when both are engaged on the K80, is only about 2x faster than a single MacBook? How can that be?
Thanks,
Adam S.
Thanks for the great intro article! I tried the steps in the article, but when I try to run the compiled add_cuda ap, I get "segmentation fault". I'm running on a macbook pro (Mid 2014) with NVIDIA GeForce GT 750M. What am I doing wrong? How can I debug this?
EDIT: When I run the code at https://devblogs.nvidia.com... , it seems there are zero CUDA-capable devices, even though the GT 750M is supposed to support it (according to https://www.geforce.com/har... ). Fortunately, I happen to have a Jetson Tx2 lying around. Can I just follow this same tutorial on the Jetson directly, or do I need to cross-compile somehow?
It'd be clearer if you mentioned how many times the for loop gets called per thread
Can someone tell me how to change the values of gridDim and blockDim from 1 to any other value and 0 in case of threadIdx.y or z and blockIdx.y or z?
Also, In my case there is no change in the execution time after modifying the code for multiple blocks. Can somebody tell me how to implement it?
Thanks.
Can
someone tell me how to change the values of gridDim and blockDim from 1
to any other value and 0 in case of threadIdx.y or z and blockIdx.y or
z?
Also, In my case there is no change in the execution time after
modifying the code for multiple blocks. Can somebody tell me how to
implement it?
Thanks.
This developer blog is interesting and informative; however this introductory blog does not explain that, when using unified memory, GPU performance may be improved significantly by prefetching the unified-memory data to the GPU, as explained in the "Unified Memory for CUDA Beginners" developer blog (https://devblogs.nvidia.com... that is mentioned in association with Exercise 4 of this developer blog.
I have measured GPU performance via a test case that executes the first step of a "medians of medians" algorithm. The host generates an array of 1 billion 32-bit floating-point numbers. The device treats the array as 200 million contiguous 5-element sub-arrays and sorts the sub-arrays via insertion sort. There are two variants to the test: one that doesn't use unified memory and one that uses unified memory. The hardware configuration is four Tesla V100 GPUs, each with 16 GB memory. Only one GPU performs the insertion sort.
For the variant that doesn't use unified memory, the host creates a 1-billion-element array in host memory, copies that array to device memory, executes a kernel to perform insertion sort (in parallel) of all contiguous 5-element sub-arrays, and copies the result back to host memory. Below are the execution times in milliseconds (all time measurements are obtained via the cudaEventRecord function, with the exception of the initialization of the 1-billion-element array where the time is measured via the clock_gettime function):
Initialize array 3364 ms
Host to device copy 472 ms
Insertion sort kernel 36 ms
Device to host copy 330 ms
Total device time 838 ms (sum of host-to-device-copy, insertion-sort-kernel, and device-to-host-copy times)
For the variant that uses unified memory, the host creates a 1-billion-element array in unified memory and then executes a kernel to perform insertion sort (in parallel) on all contiguous 5-element sub-arrays. Below are the execution times in milliseconds (array initialization time is measured via the clock_gettime function and the kernel execution time is measured via the cudaEventRecord function):
Initialize array 4478 ms
Insertion sort kernel 1121 ms
From the above data, it is apparent that kernel execution is 1121/36=31 times slower for unified memory.
Note also that the āinsertion sort kernelā time for the unified-memory variant (1121 ms) is 1.34x the ātotal device timeā for the non-unified-memory variant (838 ms). However, this comparison neglects the additional time (4478-3364=1114 ms) required to initialize the unified-memory array relative to initializing the array in host memory. So, including this additional time demonstrates that the execution time for the unified-memory variant (1114+1121=2235 ms) is 2.7x the ātotal device timeā (838 ms) that is achieved by the non-unified-memory variant.
And, for completeness, I modified the non-uniform-memory variant so that it copies the 1-billion-element array into the memory of GPU-0 and then executes the insertion-sort kernel on GPU-1 in order to compare the cost of accessing memory from another GPU to the cost of accessing unified memory. In this case, the āinsertion-sort-kernelā time increases from 36 ms to 471 ms, or by a factor of 13x, compared to the 31x increase in āinsertion-sort-kernelā time for the unified-memory case.
So, in order improve GPU performance, I modified the unified-memory variant so that it prefetches the unified-memory pages to the GPU prior to execution of the insertion-sort kernel. Below are the execution times in milliseconds:
Initialize array 4477 ms
Prefetch pages 377 ms
Insertion sort kernel 36 ms
Total device time 413 ms (sum of prefetch-pages and insertion-sort-kernel times)
The above data demonstrate that when prefetching is used, the 'total device time' for the unified-memory variant (413 ms) is significantly less than (i.e., about half of) the 'total device time' for the non-unified-memory variant (838 ms). This analysis neglects the additional time required to initialize the unified-memory array relative to initializing the array in host memory (4477-3364=1113 ms). However, perhaps the ease of using unified memory to avoid explicitly copying data to and from the GPU justifies this additional initialization cost.
This is very helpful. Thanks
Mark, I'm pleased with the results, however how is the maximum number of block size and number of blocks related to my graphics card specifics? For example, if I have GTX 970 with 1664 cuda cores, what it means for those numbers? I found out that if blocksize > 1024, output is not correct. If numblocks is too large, program crashes. I want to avoid using numBlocks = (N + blockSize - 1) / blockSize, as N is sometimes very large (that's why I'm learning this).
I can't find simple and clear explanation of numblocks and blocksize. The bigger the better? Ideal values? etc...
Are blockId.y and blockId.z meant for nested loops (for example, triple integrals)?
Will you do some follow up on reductions? What if I want to sum something (like in integral)?
I got this to work in a k80, but for some reason my laptop running nvidia 940mx returns a Segmentation Fault.
This Code crashes on 920M (GK 208), and the system needs to be restarted.
Although CudaMalloc followed by CudaMemcpy() works.
* windows 10
* msvc 2015 / Release 64-bit