Wishlist Place your considered suggestions here

We are always working on reducing driver overhead, but I think the biggest benefit forthcoming will be CUDA v1.1’s improved support for CPU/GPU overlap. The CPU will be able to continue executing code (including the driver) while the GPU is memcpy’ing or processing data, so driver overhead at least will be hidden as long as the GPU is busy.

Apps will need some updating to add the needed synchronization.

Alright, I’ll throw in what I’d like to see in Cuda (all stuff software related):

(1) expose mul24.hi., mad24. in Cuda
(2) expose {mul|mad}.wide.* in Cuda (but haven’t checked what something like { u32 a, b; u64 prod=a*b; } looks like after translation in PTX, I guess that should be the way)
(3) inline assemly
(4) a way to call (or inline…) a function (.func) from a home-brewn PTX file. Getting the PTX code into the cubin is easy, but AFAIK there’s no way to declare device extern * and stop nvcc to complain about not inlining. (Ok, Cuda will most likely need to support calls for that one)


Is there a release date for CUDA v1.1? I’m very interested in this functionality. (Also 64bit Windows drivers would be great :magic: )

It does not use one byte of Nvidia technology so it works from .cu files. Great as a toolchain independent verification tool (how I managed to post correct code in the RNG thread well before running any hardware). Not sure what to do with it at present as I think it is quite valuable but I am not using. AFAIK it does everything correctly per hardware except that it does get a different answer for order of execution of divergent segments as I am in the dark about Nvidia’s divergence/convergence algorithm also, and mine is definitely different to G80. Of course none of the Nvidia libraries are supported (just not tested). It does handle concurrent writes to shared and global, guarantee atomicity of 64 and 128 bit writes, give an accurate diagnostic for __syncthreads() unreachable from all threads, properly emulate warp size in divergent code (except there are grey areas here which are undocumented), and best of all detect race conditions in shared and global memory. I had to give the history above to convince them to take me seriously as a sync that all threads don’t have to reach sounds impossible, but I assure you it works as it can be successfully emulated in surprisingly few lines of C.

I can see I am going to get burned when the G92 comes out as it sounds like warp size is going down to 16 and if sync is not fixed (too late now) I will have redesign as the dimensionality of my problem fitted nicely into a warp and my code is divergent and syncless.

Sorry to sound like I am complaining all the time but since I posted a criticism about the device driver and suggested they should open source it (which was deleted from this forum by Nvidia on 4th May) they have not answered a single significant question.


That’s good to hear, but I’ve already reduced # of passes to the point that I have a cudaMemcpy after each kernel. Would 1.1 perform better in such cases?

To Eric:

Just out of curiosity, is your emulator based on macro processing, or an entire compiler + emulator tool chain?

A small wish:
(Driver?) Device memory defragmentation. I just tried to allocate 200M and got a failure when there’s still 300M of free memory. I only allocate that once, and I’m OK with spending seconds in defragmentation.

Thanks for the feedback, sounds like the driver is coming together well. We wait in anticipation… Don’t suppose you can say if 1.1 is coming out before G92 or at the same time?

Wish 4: to Nvidia management - to allow engineers to acknowledge bugs posted and wishlist items posted here to say perhaps 1) in pipe, 2) on todo, 3) in consideration or no answer for reject. I am sure you will get more bugs posted and ideas aired that could be worth quite a lot of real money without giving too much away to the competition. Does help relieve the pain of having one’s time wasted by a toolchain bug.

OK it is really very simple, just a header file and library that allows standard .cus to be compiled directly with gcc in one pass. The fact that this could be done quite easily gave me lots of confidence initially that CUDA was not too proprietary and that any investment in coding at this level would likely remain useful in the longer term. I think this is a really good marketing point. The base level emulator, like Nvidia’s (warp size of 1) can be implemented in a short header and not much more than one page of clever code which is why it is so fast. I leave the rest, as they say, as an exercise for the reader. Don’t ask any more questions as I think it is an ideal Uni assignment. The full function hardware emulator with all the features above only required 2Kb of object code in the emulator library. One should be able to link with the Nvidia emulation math library (I have not tried) if you need closer to bitwise correctness of functions, however that is not required for most applications and the standard library can be used.

In the spirit of the article linked at the top here, I am sure that fixing sync and the compiler to do it automatically, will significantly improve the usability of CUDA and leave warp size as a hardware parameter that most programmers can just ignore. Has to be a better API.

Cheers, Eric

ed: This topic is broken out to a new thread: Fast DIY device emulation

Eric’s exercise is a bit hard to me…
e.g. I haven’t figured out how to handle <<< >>> yet.

Another important wish:
(Driver?) Efficient reading back of GL render result.
currently… glReadPixels + PBO lock = two memcpy :(

BTW: Would 1.1 come out this week?

When will 1.1 be released? There is quite a few bugs in the current release that I reported, and turned out to be known and fixed for a while already in the internal development version, so IMO it’s really time for a new release.

Agreed! Unfortunately… as of mid-august the answer was november. I think 1.0 really wasn’t 1.0, and it’s too bad nvidia isn’t eager to rectify it.

http://forums.nvidia.com/index.php?showtopic=35855 (post 16)

NVIDIA guys! Would you please just release a bug fixed ptxas soon? I seem to have stumbled on the bug again today:(
OK, turns out to be my own bug. Though still want it…

Windows server 2008 and Linux support on IA64 would be great. Those machines seems to have far more I/O capability than what I can find in x86 servers. Take Altix 450 for instance. By this time next year the I/O advantage might be even greater due to Tukwila.

What I’d like:

  • be able to work on several GPUs from a single thread. (BTW I’d be very interested to know why the current one thread - one gpu condition is required).
  • have a passive equivalent to cudaThreadSynchronize() [i.e that does not eat 100% cpu. I guess the calling thread could sleep in the O.S kernel and be awaken by an interrupt].
  • maybe have a simple function that returns a boolean telling whether a GPU has finished its processing or not.

haha, boy is that a niche market. What do you mean by I/O?

I’d like to see in-place definitions of kernels, like

/cpu code/
<<<griddim, blockdim>>>( args ){
External Image /gpu code/
/cpu code/

(how to make indentation in this board?)

I wish cublasSscal (int n, float alpha, float *x, int incx) was cublasSscal (int n, float *alpha, float *x, int incx).

In that case I could keep alpha in the GPU memory and reduce communication overhead.

I don’t see the point. Transferring a pointer to a float is at least as expensive as transferring the float itself.

EDIT: ah. Or do you mean that in your case alpha is the result of a previous cuda kernel ?

I/O as in PCIe lanes. It is hard to find any computer with more than 40 PCIe lanes these days and I’d like to make a GPU based “super computer”. Using only one computer as opposed to a cluster really simplifies things a lot for us. Means we could connect directly to the SAN and would have no need for special cluster network or slow GbE network. Actually the computer it self with the PCIe links sort of becomes the cluster network, with each GPU being a “node” . Cheap and fast, I assume.

At the SC06 conference it was shown a 8 socket Opteron machine that would fit my use very well (page 13): http://www.gpgpu.org/sc2006/slides/13.houston-cluster.pdf

However it only got 2 x 16 + 1 x 4 PCIe lanes, which is really not much considering the amount of data that box can push.

As for IA64 support I’d like that because I could run the more complex algorithms that don’t play well with a GPU very fast on the CPU in stead. Of course I could do that on a x64 server too. Getting Windows server 2008 support would definitely be very helpful though and I’m sure that will be a rather large market too, as opposed to the current size of IA64 markets. Is the work related to porting to another ISA small, maybe less than porting to another OS?

In any case I don’t really like to use Intel 5000 or 7300 chipsets because of limited memory and I/O bandwidth (yeah, that is still PCIe I’m talking about). AMD seems to be rather late on delivering the good K10 systems with 4x HTT 3.0, but IA64 already got machines with almost unlimited PCIe bandwidth.

Almost left out an important one:
(driver/hardware) Some specified launch failure

[hardware/software] - Already suggested and confirmed to be a possibility by Mark but i’ll put it here too: the ability to pipeline data transfer to/from the card and the execution of a kernel.