What can't you do in CUDA that you'd like? Requests for the future

I recently closed the old feature request thread because it wasn’t really useful to us as feedback. So, here’s a new one, but let’s do things differently:

Instead of saying “I want this feature” or “I want that feature,” don’t tell us the feature you want. Instead, tell us the end result you want to accomplish (faster X, ability to do Y within a single kernel, etc.), and let us figure out how to implement it. There are a lot of tricks we can do, so the exact feature you want may not be the best thing. However, we’re not (all) psychic, so unless we know what you’re trying to accomplish, we may not know if another feature that we could implement would solve your problem.

So, examples:

BAD: “I want to arbitrarily pin memory allocated by something else!”
GOOD: “I want a memory region where I can DMA both to my CUDA device and my InfiniBand card!”

I realize this is probably different than you’re used to, but bear with us. Hopefully, this will be more useful going forward.

I want malloc/free in kernel code. It would make linked list and other dynamic data structures possible.

I have been trying to implement a lock-free memory allocator for this purpose. But it really takes time to make it work. Moreover, during the implementation, I have encountered several bugs in NVCC. The compiler seem to dislike complex programs like this.

Please try to get rid of the requirement that host memory must be allocated by cudaMallocHost() in order to use streaming, async. transfers, fast transfers.

Imagine you have 100k+ lines of Fortran code and just to get decent PCIe rates, you have to rewrite the entire storage backend in C linking against CUDA.

I have doubts this is doable, though.

What about arbitrary block size? i.e, the ability to define something like this:
dim6 dimGrid( a, b, c, d, e, f );
kernel<< dimGrid, 256 >>>(. … )

Currently I have to do all sorts of calculation in the host and kernel to figure out what the block indexes are, like:
dim3 dimGrid( a * b * c, d * e * f );
kernel<<< dimGrid, 256 >>>( … )

and in the kernel all sorts of div and mod operations which are both expensive and error prone.

BTW- what about hardware requests ??? :) :"> I’d kill for more shared memory and a few more registers … ;)


Yeah, it’s not. If you have the ability to arbitrarily pin memory, you also give users an easy way to dramatically impact system stability. It’s never going to happen.

I’d like to be able to efficiently implement IIR filter structures on Cuda devices.

How many filter taps? Real or complex numbers? How many in parallel? And what’s currently preventing you from implementing an efficient IIR filter kernel?

Are you asking for some kind of library provided by nVidia implementing arbitrary IIR filters, similarly to what CUFFT does for Fourier transforms?


How about if registers spilled to shared memory first, and only if there wasn’t enough would they fall back to local memory? Possibly this should be left as an option to the programmer, but there are kernels that use many registers and little shared memory, there’s room for automated use of smem as extra registers (something I’ve seen done by hand).

You guys provided so many things I was looking for in CUDA 2.2, that I’m not sure what I would want next.

Here are a few:

  1. Faster semi-random memory reads (specifically, faster than tex1Dfetch is currently).
  2. Ability to perform a floating point sum reduction in a single kernel
  3. Ability to recover from CUDA flagged errors and continue program execution without shutting down and restarting the process.

Personally I am interested in fairly short real numbered IIRs (often biquads but perhaps longer sometimes) as I concentrate on audio, but there should be no need to be too specific really.

I believe what is stopping me implementing this efficiently on Cuda is that I think I can only write a kernel that is appropriate for use by a single block at a time (on a given data array) due to the recursive/feedback elements in the filter. So I say it is inefficient since I am unable to take advantage of much of the parallelism on offer, so as devices get more and more processors they would not be used in areas of sequential code.

The solution to this might be smarter algorithm design on my part (which others could help with) or perhaps further support for sequential code from the architecture (maybe by identifying a region as serial and providing a MHz boost to an individual multiprocessor if others are known to be inactive or even adding in hardware support for some common types of algorithms (like happens with graphics) if sufficient demand from market segments was there). The idea of using Cuda as a replacement for traditional DSP cards appeals to me, but I might be wandering too far off topic, or I could just be talking garbage here…

I think the most important thing for nVidia to release ASAP is an updated CUFFT and CUBLAS library + source. We’ve been promised updated versions with vvolkov’s kernels for a long time now, and they’ve been held up for whatever reason. I think that these libraries provide a quick way for a novice CUDA developer to add some GPU functionality to their programs without a lot of work, which may get them ‘hooked’ on it, and thus be more likely to start doing real kernel coding.

I would also like to see you guys add performance monitors and event logging to the driver, so that we can view the current state of the GPU in the Resource Monitor / Performance Monitor on Windows. I don’t think it would be terribly difficult to do…I’ve done a little with this API before, and the newest version for Vista/Server 2008/Windows 7 is supposedly much simpler to use. See this recent post of mine for more info: http://forums.nvidia.com/index.php?s=&…st&p=529104

Also, this has come up a million times before, but you guys need to add some kind of GUI into the nVidia control panel to deal with the monitor situation on Vista. Even if it’s just one/a few simple registry settings, it’s not something people are going to figure out without coming to this forum and searching around to find the answer. It’s much faster and easier when someone can just pop open the control panel, check a couple of boxes (or whatever) and be done with it.

Two high level requests:

  1. I would like CUDA programming to be easier. I know, that’s a little too high level. :) I recently posted a message on the Wishlist thread describing some specific changes that would make it easier. See http://forums.nvidia.com/index.php?s=&…st&p=526791. But generally, it currently requires the user to do a lot of things by hand that a smart compiler really ought to be able to do automatically.

  2. I want my programs to run faster. :) But to give a very specific case, I have an algorithm that requires a large number of kernel calls, since it periodically requires synchronization across all threads in all blocks. As a result, the latency of kernel execution is destroying my performance. The CPU time is almost 3x the GPU time. This could be addressed in a number of ways, such as 1) a version of __synchthreads() that synchronizes across all blocks, not just within one block, 2) decreasing the latency of kernel invocation, or 3) allowing me to submit several kernel calls at once, so that they get queued up on the GPU and executed with very little delay between them.


Peter, you can do number 3 now. Kernel launches are asynchronous, and there’s no need for explicit synchronization between them.

Also, everyone, please pay attention to how I want feedback in the first post.

How? I’m not doing any explicit synchronization between kernel launches, but still when I run cudaprof I see 5 to 15 us CPU time overhead for every kernel invocation.


As MisterAnderson said, “faster semi-random memory reads (specifically, faster than tex1Dfetch is currently)” is the thing I’d like to have. It is not always possible to simplify the algorithm enough for making kernels that may utilize coalesced reads and writes. Texture fetches do speed things up, but not dramatically.

Non-2.2 cudaprof blocks on kernel execution, I think, so that might be artificially hurting you in the profiler.

Full control over batching is probably something we should investigate, though…

Another thing…I think an awesome, awesome tool would be to incorporate the visual debugger into Visual Studio (as a plugin or something…some other 3rd-party development tools I have do this). Also, adding debugging in device mode to Visual Studio would be cool too, since there are plenty of bugs that the emulator won’t replicate; it would be important to be able to step through the code as well.

Microsoft has got some new tools for debugging multithreaded programs…I don’t know if they are just free standalone programs or if they are going to be integrated with Visual Studio 2010, but if you could make something similar (or make your debugging compatible with their methods)…that would be a huge help.

I would like performance/profiling tools that ‘clearly’ identify performance bottlenecks, and card resources used.

The visual profiler is nice for knowing if you have bank conflicts or aren’t coalescing memory transactions like you thought you were - but it’s not always (and in some cases never) clear what exactly is slowing down performance on your card - and even worse, you can’t really profile real time systems reliably - or efficiently.

Something like task manager for CUDA would also be fantastic for gauging why your application is so slow on Card X, but not Card Y rather quickly - instead of having to run the visual profiler on 50+ kernels per card (I’ve literally had to spend days profiling kernels on 3-4 different cards before). All it really has to do is log over time (graph would be nice) the current kernel for each MP (if any), and memory transaction/bandwidth info - to quickly gauge if you’re memory or bandwidth limited, and which kernels are most limited by which - and which kernels are worse than others. We also have a requirement of making sure our CUDA kernels use as little GPU as possible (as we expect these kernels to run along side GPU intensive applications like games, thus our impact has to be minimal).

A method for reducing values across blocks (this is pretty major imo) - I realise ‘generic’ inter-block synchronization is hackish, and over-all a bad paradigm to follow in regards to CUDA (blocks are meant to run independently) - but reduction of values that come from multiple blocks ‘is’ something you want to do - and the fact it can’t be done right now means you have to launch a batch of kernels, sync to CPU, copy data to CPU, reduce on CPU, upload back to GPU, and run next batch of kernels that needed the reduced values - it’s excessive overhead - and it’s a very common case (especially in relation to sorting, and searching large data sets). The problem here however is the fact that all (let’s say 1000) blocks will never run at the same time, so how does one store the results of all the blocks, before the ones that will do the reduction finish? (without screwing up the scheduler by making tons of blocks ‘wait’ on blocks that may never get scheduled in)

Zero-overhead access to OpenGL resources - let’s be honest, all we need is a memory address, and an understanding of how to index the data at the address. I’m still not even remotely sure why I have to ‘register’ a texture to CUDA - so long as I can guarantee the texture/buffer is resident before executing the kernel - why do I have to ‘register’ a texture? Also, I can’t imagine locking a buffer/texture should take more than 50us in the case that there are no currently active opengl commands being processed… (the latter case where there’s a write command being processed on that resource - I can understand taking a while though). I’m not over-exaggerating when I say 700us-1.4ms out of ~10ms of my real time application are spent registering/locking OpenGL buffers - which is just unacceptable…

I want code reflection, in the runtime API. That way I can have the user input portions of the code for the kernals themselves. The program is a fractal renderer, so it’s important to be able to explore and try new equations ( = slightly different kernal code)

Umm…I actually might have something available for that within 2-4 weeks (if not sooner). I’ll keep you posted.