Parallelism On Multiple Blocks Seems Broken

There is an issue with that tutorial in connection with Unified Memory. The two GPUs where performance is quoted in that tutorial (GT 740, Kepler K80) are both Pre-pascal GPUs and operate in a pre-pascal UM regime. You can read more about it in the UM section of the programming guide. Specifically, this means that UM allocations are transferred en-masse to the GPU at the point of kernel launch. Therefore the kernel code exhibits no page faulting activity.

On your Turing GPU, however, the UM regime is a post-pascal regime, allowing for demand-paged transfer of data to the GPU. This is great, but it can have a negative performance impact. You can “rectify” this issue by inserting the following lines of code immediately prior to the kernel launch:

cudaMemPrefetchAsync(x, N*sizeof(float), 0);
cudaMemPrefetchAsync(y, N*sizeof(float), 0);

This will transfer the data to the GPU prior to the kernel launch, so no page-faulting activity takes place during kernel execution. You should then witness execution times in the low 10’s of microseconds on your GPU, in nvprof. Also you will see differences in nvprof reporting of data transfer and page-faulting activity.

You can read additional commentary here.

1 Like