Accelerating Python on GPUs with nvc++ and Cython

Originally published at: https://developer.nvidia.com/blog/accelerating-python-on-gpus-with-nvc-and-cython/

The C++ standard library contains a rich collection of containers, iterators, and algorithms that can be composed to produce elegant solutions to complex problems. Most importantly, they are fast, making C++ an attractive choice for writing highly performant code. NVIDIA recently introduced stdpar: a way to automatically accelerate the execution of C++ standard library algorithms…

Nice post – it’s great to see access to GPU from Python/Cython.

I’m trying to use the jacobi_solver example in the post as a starting point to write a GPU-accelerated function to perform convolution using the for_each algorithm from standard C++ library.

I don’t want to reinvent the wheel and was wondering if basic convolution of a kernel with a 2D array (e.g., using two 1D convolutions) might already be implemented on GPU in a way similar to the example given in the jacobi_solver.

Thanks!

Thanks for your message!

What you’re proposing sounds very much similar to the Jacobi example. You would need to write a functor (similar to avg) that encapsulates your kernel.

In fact, if I’m understanding correctly, couldn’t the Jacobi solver be thought of as repeatedly applying the following kernel?

0   1/4   0
1/4 0   1/4
0   1/4   0

@ashwint - Thanks for the reply – that clarifies it a lot.

In trying to write the code, I’ve noticed NVIDIA HPC SDK doesn’t appear to be available for Windows yet. Any idea when it will reach Windows?

Thanks!

Hi! Thanks again for your interest. We plan to have Windows support for the HPC SDK later this year.

Hi @ashwint, thanks a lot for this great post! I have a quick question regarding the Figure showing the speedup over numpy sort: why is serial CPU processing doing better at smaller sample sizes? I understand that GPU parallel processing capacities are not enfolding their power at small sample sizes - but what creates the overhead?

Thanks, @boehmvanessa. Likely, it’s the cost of transferring data from the host (GPU) to the device (GPU) and back.