Suggestion for CUDA


I’ve been using CUDA a little bit now, and I think it is really great and very impressive. However, there are many small things that one needs to know to implement things correctly, and there are many “gotcha’s” with shared memory and latency and the correct number of threads and grids, etc. I was wondering if the CUDA team had considered a high level API along the lines of the openmp syntax, where the details of the threading are hidden from the user? Coming from the world of OpenMP, I found it orders of magnitude easier to implement than CUDA. Mastering OpenMP took a couple of days, and I feel like I could spend months on CUDA and not have it truly mastered. I realize that an openmp like syntax would probably not be as efficient as doing it by hand, but the tradeoff in the amount of time saved coding would really spread its usage IMO. I would love a 100x speedup in my app, but I would also be very happy with a 10x speedup if it saved me a month of learning to master CUDA. Just a suggestion that might make things a bit easier. Thanks for reading.

Are you speaking of OpenMP code directly on the gpu or even code that would be executed from the host, thus dealing with data movement as well ?

There are some research on such topics, but we are not really there yet. If you find it hard to write code by hand, imagine that it’s even harder to generate it automatically :) Works needs to be done on both the runtime and compiler level.

If you really have trouble with gory details, perhaps you should have a look at higher level langages such as Rapidmind for instance. There should be other similar solution, but i guess this all depends on the kind of application you are targeting.

However, understanding low-level behaviour makes it easier to design applications that exhibits descent performance. For instance, with HPF you could just change a directive, making the entire program to collapse, not understanding the actual lower-level parallel programming and the communication schemes involved would make it virtually totally impossible to debug and to maintain. So my personnal point of view is that it would be really worth spending time learning CUDA (or any similar thing) not to just be the slave of some higher level solution.

just my 2 cents,


I have to say that I agree somewhat with both of you.

I think that there is a point in not wanting to be concerned about all the little details with thread, grid sizes, latency, warps and so on. Just make the code do whatever it is supposed to.
On the other hand, in the quest for mseconds and performance I think it is necessary to know those things if you want to tweak the application right.

I think that this could be achieved with good debugging tools (coming soon right?) and a better Occupancy Calculator (the one offered is ok but could be much better). This could be a guide for setting to start with and maybe to save some time coding, yet offer ability to tweak.

The challenge with providing an OpenMP-like interface to the GPU is that the parallelism models are actually very different. OpenMP is fundamentally a task-oriented parallelism construct. This works well on SMP systems, since each CPU has an entirely independent program counter, but a coherent view of system memory. Task parallelism is also a natural extension for procedural languages, which also helps OpenMP to be minimally invasive.

CUDA is designed for data-parallelism, as that is the sort of parallelism GPU hardware is optimized for. There are not independent program counters for each thread, and the memory system is built for fast, long sequential reads. Designing an effective data-parallel algorithm sometimes requires thinking about the problem and structuring the data differently. (Certainly in simple cases, there are direct mappings from task parallelism to efficient data parallel algorithms.)

That said, automated extraction of data-parallelism from code is definitely possible. SSE is a data-parallel instruction set, and Intel has had auto-vectorization code in their compiler for a while. GCC got preliminary auto-vectorization in version 4.1, and I think it is now a standard optimization in 4.3. GPU analysis would be a little harder, since there is a high cost for moving data to and from the GPU over the PCI-Express bus. Minimizing these transfers requires higher-level code analysis, which becomes more challenging to automate. This is where OpenMP-style annotations could be valuable.

It would certainly be a fascinating research project, for any of you CS grad students out there. :)

I didn’t say it would be easy ;D

My main motivation for suggesting this is that I just don’t have time to learn all of this. I wish I did, but I have deadlines (sadly). Even if an openmp implementation wouldn’t allow you to squeeze out every last drop of performance, many projects can benefit from any speedup. For instance, say i have a loop

#pragma omp parallel


#pragma omp for

for(int i = 0; i < smallnum; i++)


 #pragma omp for

   for(int j = 0; j < reallybignum; j++)


      // do stuff







This structure is disallowed in openmp because the omp fors are too close together, but would obviously benefit from cuda. That and everything within a openmp pragma can be scoped as private (or in the case of CUDA - shared). No playing with pointers at the programmer’s level, no splitting off into kernels. In reality, even openmp hasn’t quite nailed down all of the performance considerations. you still have to specify a schedule which can have an impact (in my case) of up 20%. However, this would be a smoother transition from the programmer’s view. I can only hope that NVIDIA would consider something like this.

@Seibert - the Intel compiler also makes a huge difference in scientific code. I got a speedup of 2x just by switching from MSVC. I think this problem would be easier than automatic vectorization. In the automatic vectorization, the compiler has to both figure out that a loop can be vectorized and that its beneficial. With this type of syntax you’re already letting it know that you’re parallelizing it, and if its not beneficial, well that’s beyond the scope of the API

Is Brook what you are thinking of?

diehard2, has your company considered using rapid mind?

rapid mind looks promising, however their website is a bit bare. Thanks.