Wishlist Place your considered suggestions here

nwilt · September 13, 2007, 1:46pm

We are always working on reducing driver overhead, but I think the biggest benefit forthcoming will be CUDA v1.1’s improved support for CPU/GPU overlap. The CPU will be able to continue executing code (including the driver) while the GPU is memcpy’ing or processing data, so driver overhead at least will be hidden as long as the GPU is busy.

Apps will need some updating to add the needed synchronization.

rsz · September 13, 2007, 2:22pm

Alright, I’ll throw in what I’d like to see in Cuda (all stuff software related):

(1) expose mul24.hi., mad24. in Cuda
(2) expose {mul|mad}.wide.* in Cuda (but haven’t checked what something like { u32 a, b; u64 prod=a*b; } looks like after translation in PTX, I guess that should be the way)
(3) inline assemly
(4) a way to call (or inline…) a function (.func) from a home-brewn PTX file. Getting the PTX code into the cubin is easy, but AFAIK there’s no way to declare device extern * and stop nvcc to complain about not inlining. (Ok, Cuda will most likely need to support calls for that one)

Robert.

seb · September 13, 2007, 2:55pm

Is there a release date for CUDA v1.1? I’m very interested in this functionality. (Also 64bit Windows drivers would be great :magic: )

osiris1 · September 13, 2007, 11:34pm

It does not use one byte of Nvidia technology so it works from .cu files. Great as a toolchain independent verification tool (how I managed to post correct code in the RNG thread well before running any hardware). Not sure what to do with it at present as I think it is quite valuable but I am not using. AFAIK it does everything correctly per hardware except that it does get a different answer for order of execution of divergent segments as I am in the dark about Nvidia’s divergence/convergence algorithm also, and mine is definitely different to G80. Of course none of the Nvidia libraries are supported (just not tested). It does handle concurrent writes to shared and global, guarantee atomicity of 64 and 128 bit writes, give an accurate diagnostic for __syncthreads() unreachable from all threads, properly emulate warp size in divergent code (except there are grey areas here which are undocumented), and best of all detect race conditions in shared and global memory. I had to give the history above to convince them to take me seriously as a sync that all threads don’t have to reach sounds impossible, but I assure you it works as it can be successfully emulated in surprisingly few lines of C.

I can see I am going to get burned when the G92 comes out as it sounds like warp size is going down to 16 and if sync is not fixed (too late now) I will have redesign as the dimensionality of my problem fitted nicely into a warp and my code is divergent and syncless.

Sorry to sound like I am complaining all the time but since I posted a criticism about the device driver and suggested they should open source it (which was deleted from this forum by Nvidia on 4th May) they have not answered a single significant question.

Eric

asadafag · September 14, 2007, 5:43am

That’s good to hear, but I’ve already reduced # of passes to the point that I have a cudaMemcpy after each kernel. Would 1.1 perform better in such cases?

To Eric:

Just out of curiosity, is your emulator based on macro processing, or an entire compiler + emulator tool chain?

asadafag · September 14, 2007, 8:44am

A small wish:
(Driver?) Device memory defragmentation. I just tried to allocate 200M and got a failure when there’s still 300M of free memory. I only allocate that once, and I’m OK with spending seconds in defragmentation.

osiris1 · September 15, 2007, 8:27am

Thanks for the feedback, sounds like the driver is coming together well. We wait in anticipation… Don’t suppose you can say if 1.1 is coming out before G92 or at the same time?

Wish 4: to Nvidia management - to allow engineers to acknowledge bugs posted and wishlist items posted here to say perhaps 1) in pipe, 2) on todo, 3) in consideration or no answer for reject. I am sure you will get more bugs posted and ideas aired that could be worth quite a lot of real money without giving too much away to the competition. Does help relieve the pain of having one’s time wasted by a toolchain bug.

OK it is really very simple, just a header file and library that allows standard .cus to be compiled directly with gcc in one pass. The fact that this could be done quite easily gave me lots of confidence initially that CUDA was not too proprietary and that any investment in coding at this level would likely remain useful in the longer term. I think this is a really good marketing point. The base level emulator, like Nvidia’s (warp size of 1) can be implemented in a short header and not much more than one page of clever code which is why it is so fast. I leave the rest, as they say, as an exercise for the reader. Don’t ask any more questions as I think it is an ideal Uni assignment. The full function hardware emulator with all the features above only required 2Kb of object code in the emulator library. One should be able to link with the Nvidia emulation math library (I have not tried) if you need closer to bitwise correctness of functions, however that is not required for most applications and the standard library can be used.

In the spirit of the article linked at the top here, I am sure that fixing sync and the compiler to do it automatically, will significantly improve the usability of CUDA and leave warp size as a hardware parameter that most programmers can just ignore. Has to be a better API.

Cheers, Eric

ed: This topic is broken out to a new thread: Fast DIY device emulation

asadafag · September 15, 2007, 12:16pm

Eric’s exercise is a bit hard to me…
e.g. I haven’t figured out how to handle <<< >>> yet.

Another important wish:
(Driver?) Efficient reading back of GL render result.
currently… glReadPixels + PBO lock = two memcpy :(

BTW: Would 1.1 come out this week?

wumpus · September 18, 2007, 9:22am

When will 1.1 be released? There is quite a few bugs in the current release that I reported, and turned out to be known and fixed for a while already in the internal development version, so IMO it’s really time for a new release.

alex_dubinsky · September 18, 2007, 6:58pm

Agreed! Unfortunately… as of mid-august the answer was november. I think 1.0 really wasn’t 1.0, and it’s too bad nvidia isn’t eager to rectify it.

http://forums.nvidia.com/index.php?showtopic=35855 (post 16)

asadafag · September 20, 2007, 8:45am

Thanks…
NVIDIA guys! Would you please just release a bug fixed ptxas soon? I seem to have stumbled on the bug again today:(
OK, turns out to be my own bug. Though still want it…

ajens · October 2, 2007, 9:25pm

Windows server 2008 and Linux support on IA64 would be great. Those machines seems to have far more I/O capability than what I can find in x86 servers. Take Altix 450 for instance. By this time next year the I/O advantage might be even greater due to Tukwila.

Simon · October 3, 2007, 1:20pm

What I’d like:

be able to work on several GPUs from a single thread. (BTW I’d be very interested to know why the current one thread - one gpu condition is required).
have a passive equivalent to cudaThreadSynchronize() [i.e that does not eat 100% cpu. I guess the calling thread could sleep in the O.S kernel and be awaken by an interrupt].
maybe have a simple function that returns a boolean telling whether a GPU has finished its processing or not.

alex_dubinsky · October 4, 2007, 5:13pm

haha, boy is that a niche market. What do you mean by I/O?

vvolkov · October 9, 2007, 4:03pm

I’d like to see in-place definitions of kernels, like

/cpu code/
<<<griddim, blockdim>>>( args ){
External Image /gpu code/
}
/cpu code/

(how to make indentation in this board?)

vvolkov · October 9, 2007, 4:43pm

I wish cublasSscal (int n, float alpha, float *x, int incx) was cublasSscal (int n, float *alpha, float *x, int incx).

In that case I could keep alpha in the GPU memory and reduce communication overhead.

Simon · October 10, 2007, 8:13am

I don’t see the point. Transferring a pointer to a float is at least as expensive as transferring the float itself.

EDIT: ah. Or do you mean that in your case alpha is the result of a previous cuda kernel ?

ajens · October 10, 2007, 1:56pm

I/O as in PCIe lanes. It is hard to find any computer with more than 40 PCIe lanes these days and I’d like to make a GPU based “super computer”. Using only one computer as opposed to a cluster really simplifies things a lot for us. Means we could connect directly to the SAN and would have no need for special cluster network or slow GbE network. Actually the computer it self with the PCIe links sort of becomes the cluster network, with each GPU being a “node” . Cheap and fast, I assume.

At the SC06 conference it was shown a 8 socket Opteron machine that would fit my use very well (page 13): http://www.gpgpu.org/sc2006/slides/13.houston-cluster.pdf

However it only got 2 x 16 + 1 x 4 PCIe lanes, which is really not much considering the amount of data that box can push.

As for IA64 support I’d like that because I could run the more complex algorithms that don’t play well with a GPU very fast on the CPU in stead. Of course I could do that on a x64 server too. Getting Windows server 2008 support would definitely be very helpful though and I’m sure that will be a rather large market too, as opposed to the current size of IA64 markets. Is the work related to porting to another ISA small, maybe less than porting to another OS?

In any case I don’t really like to use Intel 5000 or 7300 chipsets because of limited memory and I/O bandwidth (yeah, that is still PCIe I’m talking about). AMD seems to be rather late on delivering the good K10 systems with 4x HTT 3.0, but IA64 already got machines with almost unlimited PCIe bandwidth.

asadafag · October 12, 2007, 3:54am

Almost left out an important one:
(driver/hardware) Some specified launch failure

e.ping · October 12, 2007, 8:47am

[hardware/software] - Already suggested and confirmed to be a possibility by Mark but i’ll put it here too: the ability to pipeline data transfer to/from the card and the execution of a kernel.

Owen

Topic		Replies	Views
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134582	May 26, 2010
Unified Memory in CUDA 6 Technical Blog	87	1903	August 16, 2019
CUDA 4.0 CUDA Programming and Performance	63	507400	March 28, 2013
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9632	June 25, 2013
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	211002	January 18, 2014
CUDA very slow performance CUDA Programming and Performance	21	16756	March 6, 2020
An Even Easier Introduction to CUDA Technical Blog	141	6392	November 28, 2023
CUDA 2.2 beta features CUDA Programming and Performance	146	126073	May 19, 2009
Unified Memory for CUDA Beginners Technical Blog	46	2581	December 1, 2023
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75860	February 9, 2009

Wishlist Place your considered suggestions here

Related topics