It was my understanding. I must not work on sunday evening :)
Sorry for the inconvenience.
I don’t think that web article makes any claims that the code presented should run faster on the GPU. I don’t see any CPU performance measurements.
To discuss performance-related questions or comparisons, it’s usually important to provide:
- the operating system you are using
- how the code was compiled (what was the compile command line or project configuration e.g. release vs. debug)
- which CUDA version you are using
However, probably some level of explanation can be given:
The original article is using Unified Memory in a pre-pascal regime (read the unified memory section of the programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd )
This means that the data transfer is performed en-masse at kernel launch time, and there are no GPU page faults to slow the kernel down. In your case, you are running (I would guess) on linux in a pascal-type unified memory regime, which uses page-faulting for data transfer. This is slowing your kernel down dramatically.
You can mitigate these effects with prefetching, as discussed here:
This activity should bring your measured kernel performance more closely in line with the original web article. (The original article reported a time of about 0.5s for the kernel with <<<1,1>>> configuration as you indicate in your post.)
After that, if you proceed with the suggestion in the comment below, you should be able to follow along with the subsequent code modifications and comparisons.
Read the introduction to the end - so far you are only half-way through.
Hi Robert_Crovella & tera
Thanks for your replies.
@tera : I got it.
Thanks a lot