I’ve been using CUDA a little bit now, and I think it is really great and very impressive. However, there are many small things that one needs to know to implement things correctly, and there are many “gotcha’s” with shared memory and latency and the correct number of threads and grids, etc. I was wondering if the CUDA team had considered a high level API along the lines of the openmp syntax, where the details of the threading are hidden from the user? Coming from the world of OpenMP, I found it orders of magnitude easier to implement than CUDA. Mastering OpenMP took a couple of days, and I feel like I could spend months on CUDA and not have it truly mastered. I realize that an openmp like syntax would probably not be as efficient as doing it by hand, but the tradeoff in the amount of time saved coding would really spread its usage IMO. I would love a 100x speedup in my app, but I would also be very happy with a 10x speedup if it saved me a month of learning to master CUDA. Just a suggestion that might make things a bit easier. Thanks for reading.