Wishlist Place your considered suggestions here

This topic is intended to be used for suggestions for improvements to CUDA. Before posting, please sit on your suggestion for a couple of weeks and read this article. Perhaps Nvidia engineers should also read it while they are in the business of defining the API. Before criticising someone else’s suggestion, sit on that for a day or two to avoid foot on keyboard and making yourself look like a goose as they have probably spent a lot longer than you thinking about it.

My first wishlist item is to change the type of global functions to function returning pointer sized integer (currently long on 32 and 64 bit) and provide a throw(n) that can be called from any thread in the grid to immediately abort the running kernel sequence and return n. Unfortunately Nvidia defined the PTX TRAP pseudo code not to take a value and they don’t mention whether it aborts one kernel or a chain. Nvidia could reserve the first 100 or so numbers for more detailed launch failure codes. This data should be returned to the application through the device file (or a parallel device file) that controls the card so that we can poll/select or wait on event for the completion. The code returned could be defined to be any one of the returns from any thread, just like writing the same global memory location, for the case when there is no throw(), and zero the usual return for OK. Chain the next kernel if there is a call queued and the previous return was zero and then suppress that zero return (though I see now this has a race condition and something would have to be set in the original call to say to expect a chained kernel, or always submit chains as 1 API call which might be better all round, improving performance).

I implemented such a facility in my emulator and found it very useful for assertions and signalling running out of memory on the device within a device allocator (shared and global). I expect something like this will be required as more complex kernels and applications are written. While there are so many bugs in the toolchain and it is changing quickly an effective assertion facility on the device is mandatory (IMHO). When publishing something like the MT library one would like to assert during initialisation that it has not been called with the wrong thread/block configuration.

The other one is that they forgot to put the number of multiprocessors in the device parameters calls…

(This was written months ago and still seems relevant)
ed: extend 1st para

My wishes ( most won’t fit in HW, will be hard to implement or won’t have any sense… but… brainstorm! ):

  • Add a function stack so we can do recursive calls easy.

  • Increase limits ( instruction limit, const/shared memory, etc )

  • Better C++ integration. Make it more “OOP”.

  • Include CUDA in the Forceware drivers so we don’t need the CUDA Toolkit installer anymore.

  • More powerful syncronization mechanism. Add a “lock” critical section instruction to block writes into global memory.

  • Open-source some CUDA parts.

  • Improve Direct3D9 and OpenGL mechanism. Add DX10 support.

  • x64 and Vista support.

  • Write Java, .NET, python and D wrappers.

  • Drop prices on the Tesla units so we can play with them!

  • Better MultiGPU approach. Just do a function to enumerate all the CUDA devices. Do a CreateDevice() like Direct3D + GPU inter-communication features ( like shared buffers, syncronization, etc )

  • Add more examples to the SDK: raytracing, spatial structures iteration, etc…

FYI, that seems to be there already in the PTX VM.

Actually, especially Windows users might be more pleased with a driver that does not install as a graphics driver at all to run CUDA on machines with ATI display card or chip set integrated VGA.


Then this http://forums.nvidia.com/index.php?showtopic=39073&hl=inline ?

Perhaps NVIDIA planned it, but i’m almost sure all the functions are inlined and stackless. Ofc you can do your own stack in global memory and use stackess approaches, but is a pain.

That would be perfect, but I suspect is not possible. The GeForce are not real CUDA devices. CUDA functionality is “emulated” using Forceware drivers + shading units.

For that probably you need a real Tesla unit… ofc I can be completely wrong tho!

Thanks for starting this thread - we’re always interested in suggestions for improving CUDA.

It’s important to distinguish here between features that will require new hardware (e.g. traps, critical sections, more shared memory) and software changes to the CUDA programming model (e.g. better C++ support) that are possible today.

Santyhammer - GeForce products certainly are “real” CUDA devices - CUDA was designed to expose the hardware compute features in the G8x series. CUDA functionality is no more “emulated” than shaders are.

It would be possible to implement a function call stack using local memory, but we don’t do this today for performance reasons.

So is possible then decouple CUDA from Forceware drivers? I tryed that some time ago and couldn’t… Was using an ATI 2400 as primary adapter and a GF8500 as secondary. If I don’t install Forceware the CUDA examples don’t work(shows a message saying they need FW 162 or above)… but installing FW+Cat starts a driver war and finally my WinXP says “sorry, I cannot boot” :P

So i’m not sure if we want a CUDA runtime embedded inside FW or simply a CUDA installer completely independent of FW. Perhaps the second option is better… and after all you will need that for Tesla units.

To Eric:
Seems throw could be implemented as a library/compiler feature. Just declare a global variable and set it. In case multiple threads throws, the result is undefined anyway.

My wishes:
(Software) Make ptxas throw something more meaningful than "Internal error"s so writing compiler backends to generate ptx is more feasible.
(Software) Full control over load/store instructions at high level. Enable the programmer to choose between ld.shared/local/global. Could use forced pointer cast or something
(Software) VC compatible name mangling for namespaces and stuff so we can watch globals defined in a namespace in VC.
(Software) #pragma to temporarily disable CSE, *+ to mad
(Software) Inlined ptx assembly, mad intrinsic
(Document) Detailed warp divergence/convergence spec.
(Hardware) Reduce warp size to eventually make true SPMD

CUDA is stackless - but the PTX virtual machine supports it. See the PTX manual for a description of call-by-value vs. call-by-reference and subroutine calls. This has not yet been brought to the (high level) CUDA spec though.

GeForce, Quadro and Tesla are all the same chips at the core. They differ in some on-die rendering features (antialiased lines on Quadro etc) and are of course mounted on different boards (better VRAM on Quadro etc). But the processors are the really the same. So there is no reason why Tesla should behave differently to Quadro or GeForce. So if NVIDIA comes up with a non-graphics driver to talk to the card, it should work with all of them.

Don’t confuse the card driver with the rendering API (OpenGL/DX) or CUDA ontop of it. For efficiency reasons, the API ontop however is usually tightly coupled to the driver functions and for ease of installation, both are packaged together. This is the package you download. So it should be perfectly possible to strip OpenGL and DX from what you perceive as “the graphics driver” and run CUDA without any graphics connection.

Of course, the marketing department might want to disallow certain features … :unsure:


  1. A low-latency method to return a single integer value from a kernel without having to do separate transfers:

I agree with Eric’s suggestion that there needs to be a way to immediately return some small amount of data from a kernel without having to do a separate transfer. Beyond what Eric says, there are non-exception, non-debugging uses for this. A typical CUDA use case for many people might be as follows:

  • transfer data to card

  • launch kernel to process big pile of data, resulting in small, variable sized result (possibly, or even frequently, zero-sized)

  • transfer variable-sized result back to card if size of result is non-zero

Perhaps we would use a prefix sum to work out the aggregate result size as well as pack the output data, then exit from all threads aside from the first thread in the first block, then use this mechanism to return this single number. This would be particularly helpful where having zero bytes to transfer back to the card is the common case.

Related possibilities:

The ability to return one integer value per block in the same way…
The ability to return one bit per block, similarly
The ability to return a value (int, single bit, …) from each block that has some accumulator function applied to it (sum, and, or)
Or (most minimally) even the ability to return a single bit from the entire kernel invocation. Just one bit would be an improvement, and catches the most important case above (e.g. the ability to avoid the gpu->host transfer if there’s nothing to transfer).

  1. The ability to transfer CUDA contexts between host threads.

If you want to write a library that uses CUDA, and you’ve got to live with the library client’s threading model (i.e. you can’t just have a ‘CUDA worker thread’), and you want to support multi-threaded clients, you need to be able to transfer your CUDA context from thread to thread. Obviously synchronization is required so that nothing bad happens.

If there are clever work-arounds for these that avoid the drawbacks of the obvious approach (e.g. an extra transfer for the first case, or the sometimes unavailable option of launching a worker thread), I’d love to hear them.


Thanks Geoff for filling out on the kernel return options. That answers asadafag’s reply and also there is no doco to say if there is a different CUDA error code for a kernel that has TRAPed, not that we can access TRAP from C anyway.

Well how about helping us and being more forthcoming with engineering information?

  1. This is the BIG one: just consolidating many references through my posts for the last 4 months: fix bar.sync. The spec is wrong in the hardware. I found this out quite early on when trying to sort out convergence and warp size in my emulator. Nothing like writing a device emulator to get your head around the architecture quickly. Might seem an odd thing to do but when I first downloaded 0.8 there was no support for my dev platform (x86_64) and the emulator appeared brain damaged. Three days after seeing GPGPU for the first time I wrote the emulator in one Saturday and straight off it ran nearly 100 times as fast as the (unsupported) Nvidia emulator for the job I was doing - the MT library, being my very first piece of CUDA. Later I found being a single pass of gcc, that it was nearly 100 times as fast building a test as nvcc building a complete device application, for kernels around 100Kb of bincode. I started trying to get details of convergence not long after and those posts still remain unanswered.

Then stumbling upon the problem of warp size discrepancy between traditional CPUs and the GPU I used my convergence algorithm to solve this problem for emulating arbitrary warp sizes on a CPU with a warp size of one. Since I was using an unmodified gcc the algorithm had to work in the presence of real subroutine calls in divergent code as gcc won’t inline things that are huge. My appreciation of the problem clarified that convergence and emulating disparate warp size are almost identical and that is what we are doing when writing CUDA - having to manually insert syncs to convert a warp size of 32 into a warp size of the block size, only we don’t have the right sync to be able to do it in all circumstances. Now inserting syncs can be done automatically by the compiler if it has the right bar.sync that can be used in divergent code (not all threads have to visit) AND that does not need to be told in advance which threads to wait for. Once this is fixed it will make this model much more compelling as the preferred parallel programming architecture at the chip level (IMHO). More difficult to make mistakes.

I got here in a few months part time and see no evidence that Nvidia have realised this in the several years they have been working on it. Then there is a lot they are not telling us. Just clarifying the programming model is SIMD at the block level and SPMD at the grid level helps - it is a vector processor with a configurable vector length up to 512x32 bit words. IMO shared memory and divergence being the two most important things in the architecture.

With the above analysis a function call stack can be shown to only require 1 return entry per block so can be implemented in per block memory that Nvidia left out of the PTX spec and the hardware, or it could be in shared memory. If coalescing hardware is fixed to do misalignments properly then per block memory becomes useful as a stack or we could go with shared as is. Of course you cannot pass parameters or store autos on a per block return stack. Call and return require implicit syncs as above. The PTX spec does not even say where it is storing current execution information for subsequent return.

Eric :)

Why do you nvidia guys keep insisting on using local memory?

What you need to do is improve error reporting and then allow shared memory to be used. So what if it’s a finite resource? It’s not that finite. And much of the time it’s just sitting empty.

Of course, give the option to use local memory too (preferably on a per-use basis, such as for register spilling, call stack, or indexed arrays). The ultimate thing, though, would be automatic caching of local memory in shared. Either through hardware, or a micro os. Preferably, the mechanism could be told which parts of local memory to cache.

And my big but simple wish:


I know there ain’t many caches. Just texture and constant (tho there could also be one for local memory). But, it would really help if we could use prefetching to max out bandwidth without needing a very large number of threads.

I’d bet this doesn’t even have to be a hardware feature, since there’s quite a bit of programmability in the fetching units from what I can tell.

Regarding reductions: That’s what the atomics in compute model 1.1 are for. E.g., call atomic sum on the same piece of global memory, and you’ll get a sum of every thread in the grid or even across multiple kernel executions. Then, just transfer that one int. You can also set bits. See Programming Guide section 4.4.3 and appendix C.

Regarding transfering contexts: I haven’t tried this myself, but isn’t that what cudaSetDevice() and cudaChooseDevice() are for? If not, there’s always the old-school Driver api. It never lets you down. See section E.3, Context Management.

geoff, i’d say your ideas are very good seeing that they’d already been implemented ;-)

Seeing Eric’s response, now I want the emulator.
Does it emulate ptx, the cubin, or the C-like thing?
Whatever it does, it would be really useful for us in the dark.

Forgot one big wish:
(Driver/Hardware?) Further reduce API overhead.

Currently, almost each kernel launch of my algorithm spends 50~70us in driver (texture binding, launching, etc). And I’m already using driver API. If this could be reduced to cudaMemcpy’s 2us, my FPS may boost two or three times for small data set.

My main wish would be to simplify the coalescing to load from device to shared memory and the write back.

First possibility :

Lets say I have a struct A of size n

if all the threads call


shared_A[sid] = device_A[tid];

could the compiler automagically coalesce the transfert?

Second possibility :

Add a sharedMemcpy(dest, src, size) intruction that have to be called by all the threads end do the coalesced download.


Alex: please control your tendency to give smug, flippant answers when discussing topics that you don’t understand.

If you read nothing more than the first line of my first wishlist item, you’d have seen this:

“1 A low-latency method to return a single integer value from a kernel without having to do separate transfers”

Note the “without having to do separate transfers” bit. Your helpful suggestion that I use atomic sum to get the single value is irrelevant to the issue; thanks, but I wrote my first parallel prefix algorithm 12 years ago. I’m aware of the atomic operators. I’m well aware how to get a reduction to a single integer. What I don’t know how to do it to return a single value to the host without a separate transfer after the kernel execution. I don’t know how many single-integer transfers you’ve timed on CUDA, but even a single integer transfer with either API has substantial latency.

“2. The ability to transfer CUDA contexts between host threads.”

Always a danger sign in any forum response:

“I haven’t tried this myself”…

I can tell. The calls you suggest choose among multiple graphics cards on a system, not contexts. The documentation (E.3.1, E.3.2, E.3.3) makes it look like this transfer might be possible with the driver API, but only if you read carelessly. I made this mistake when reading the docs initially, and had to be set straight by Mark Harris. If you think there’s a way to do this, go write some pthreads code that uses the same set of CUDA resources from two different threads and tell us about it.

That’s another good one. I think the launch latency of kernels and the latency of small transfers is a big problem for those of us who don’t have gigantic data sets and little concern for latency. CUDA’s breadth of applicability will only be improved if it can better scale downwards towards lower-latency use with smaller data sets (obviously you’ll still need to extract parallelism to make the use of CUDA appropriate).


I’d REALLY like to have both Max OS X support and Solaris support for CUDA. Both of those OS’s support nVidia cards and drivers to some extent, and both are Unix based. It would make my case for using nVidia cards in some new products much easier.

Yet another wish:
(Driver/Hardware?) An option to disable some multiprocessors.

This would allow testing an algorithm’s scalability. Though CUDA kernel execution should be linearly scalable, API overhead, CPU portion and calls on small data set aren’t. Testing on different cards can be problematic since other cards have more differences than # of multiprocessors. Also, other cards are typically installed on other people’s machines, and they may not want to install a CUDA driver:(