I use PGI’s CUDA Fortran in my work and I can share a bit. First, I’ll rightfully admit I’m not much of a CUDA C programmer. I’ve done a bit with it (the usual matmul examples and all) and can read it, but I have been and always shall be a Fortran programmer at heart. This is mainly due to the fact that my graduate work, postdoctoral work, and current work all deal with codes that are FORTRAN 77 to Fortran 2003.
But, that said, I can say I find programming in CUDA Fortran quite nice. If you are from a Fortran background and are used to arrays rather than pointers and the use of MODULEs, I think you’ll like it too. If nothing else, there is an ease of reading it. For example, let’s say you have a device array dev_work(1024) and a host array of host_work(1024), instead of doing a cudaMemcpy(dev_work,host_work,1024), you can just say dev_work=host_work. Same for moving values to constant memory. (Note though, you can still use the cudaMemcpy API style calls if you want. Most of them are supported, like, say, cudaMemcpy2DAsync.)
Likewise, allocating memory on the device is as simple as using allocate() instead of cudaMalloc and deallocate() rather than cudaFree. And depending on how you defined your host variable, when you allocate on the host, it’ll allocate pinned/page-locked memory for you as well:
real, pinned, allocatable, dimension(:) :: host_work
is allocated as pinned. (Again, though, if you need/want to use cudaMallocPitch, you can!)
Indeed, most of what you want to do in CUDA C, you can do in CUDA Fortran. A few things like Textures haven’t been implemented yet, but most of the language is there, and if you find something that isn’t, PGI is usually pretty good at getting it implemented in a future release.
I’ll also say that I’ve done some programming with PGI’s Accelerator pragmas (think OpenMP style pragmas but for GPUs), and have found them to be quite good. In recent times, the Accelerator model can identify and correctly code for you things like reductions, which are a bit tricky to program in CUDA C/Fortran.
For more on what programming with PGI compilers and GPUs can look like, I recommend reading some of the PGI Insider articles. For example, this article on porting WRF shows that the pragma model can essentially match hand-tuned CUDA C code with what I’m sure was much less effort and that was with a compiler almost a year old.