High-Performance GPU Computing in the Julia Programming Language

Originally published at: High-Performance GPU Computing in the Julia Programming Language | NVIDIA Technical Blog

Julia is a high-level programming language for mathematical computing that is as easy to use as Python, but as fast as C. The language has been created with performance in mind, and combines careful language design with a sophisticated LLVM-based compiler [Bezanson et al. 2017]. Julia is already well regarded for programming multicore CPUs and large parallel…

what about host 2 device memory transfers?

Just want to point out that there are a bunch of comments and discussion around this on Hacker News


Constructing the CuArray performs a host-to-device memory transfer, whereas converting it back to a regular Array fetches the memory back.

is there support for asynchronous transfers? multiple streams? concurrent kernel and memory transfer?

Partially, eg. streams are supported and can be used for kernel execution, but asynchronous transfers are not wrapped right now. It isn't much work to add though, and I'm currently redesigning the memory buffer interface so I'll see about adding it: https://github.com/JuliaGPU...

If there's similar missing features you'd want to use, don't hesitate to file an issue at CUDAdrv or CUDAnative.