Wishlist Place your considered suggestions here

grangerfx · January 25, 2008, 12:17am

Vista 64 bit is top of my list.

Second would be double precision support.

Third would be a more general architecture which would make it easy to implement non-vectorizable algorithms efficiently such as KD Tree traversal and full ray tracing.

I realize that NVidia is working on all of these things and probably in that order so for now I am a happy camper patiently waiting for my 64 bit Vista CUDA.

bbudge · January 25, 2008, 11:32pm

This introduces another reason why CUDA contexts should be able to communicate with one-another – If you allocate with cudaMallocHost, that memory is locked, but cannot be seen as locked by another context, therefore requiring at least an extra CPU-side memcpy for multiple GPUs to agnostically DMA and share information. Additionally, presumably it wouldn’t be too difficult to get device-to-device copies to happen if contexts were aware of one-another from the programming model.

A MUCH better solution, IMHO, is to have special flags that can be passed into the memcpy functions which says whether or not the memory is pinned, or in the cases where it is necessary, require the memory to be pinned. Yes, this means the user has to be smart or possibly crash his operating system. The extra nifty silver lining is that it should be even higher performance, because CUDA doesn’t have to check if the memory is locked.

This would allow seemless use with things like mlock and mmap(MAP_LOCKED) (and the windows equivalents), which would have been extremely helpful for me the last few months.

A few other requests:

smarter register allocation. A lot of my kernels are limits to 64 threads/block (and 33% or less occupancy) because of register limitations
along the same lines, more registers, INDEXABLE registers would be even better, but I suppose more shared memory could accomplish that
streams are nice, but another paradigm that could be very useful (and is supported by CTM), is a streaming computation, where the data isn’t required to be copied via cudaMemcpy, but rather the CUDA threads themselves read the data in while performing computation. This might help hide the latency of the copy, and would allow, for example, keeping a larger dataset in GPU memory by not keeping temporary data there (read from CPU mem, write to CPU mem).

Thanks for the consideration.

Brian

seb · January 25, 2008, 11:40pm

I don’t think it’s that simple. What cudaMallocHost does is not only allocate pinned memory but also register this specific area in the memory with the GPU driver. Therefore the driver will be able to ‘track’ the memory which allows the GPU to DMA into it.

bbudge · January 25, 2008, 11:42pm

So it can’t register the memory that is passed in? It seems that such a registration shouldn’t be all that expensive.

Eri_Rubin · January 27, 2008, 9:33am

Yes of course !!! it always is in floats. I already get different results in the vector collapse code. Its not wrong results just different.

levicki · January 29, 2008, 4:11pm

My wishes are as follows.

Software:

cudaIDCT() and cuIDCT() APIs which would expose VPU hardware IDCT (and possibly dequantization) capability in a manner suitable for JPEG image decoding (a lot of 4000x3000 JPEGs out there these days)
CUDA libraries with more performance primitives, not only FFT and BLAS.

NVIDIA Management:

Searchable list of known CUDA, Cg, OpenGL and DirectX driver issues

ForceWare drivers:

ForceWare Driver support for SSE4.1 instructions (DPPS, MOVNTDQA)
Import/export functionality for driver settings including resolution, color depth, refresh rate, and image quality in 3D applications
Working anti-aliasing in games
Disabling trilinear and anisotropic sample optimization should also disable it for OpenGL applications
Person who designed control panel should be taken to the back alley and shot – this deserves longer explanation:

Nested scrollbars are the most horrid UI element in known universe. Try doing something in the control panel with just a keyboard, or with a lousy touchpad.
That UI abomination should be redesigned so that everything fits in 1024x768 resolution without the need for two levels of scrollbars, a tree, and those ugly tabbed dialogs.
Dropdown lists are impractical because you don’t see other choices until you expand them.

Reduce the number of things loading at startup. As far as I could see the service you install does nothing usefull at least for me. Ditto for nWiz, nThis and nThat obscured by launching with rundll32 so that you might think you got a trojan.
Do not install the taskbar notification area icon, that area was meant for quick access to important features and application feedback but it got abused for marketing purposes. Ask yourself how often a regular user changes resolution, color depth, etc? That’s right – once when they install the driver and even that could be avoided if there was an import/export settings feature.

Hardware:

Video cards with 1GB and 2GB of RAM which doesn’t cost an arm and leg
Mainstream video cards with 16-bit color DACs per RGB channels
Cooler running cards (8800GTX is a furnace)
Smarter cooling solutions (whoever decided to allow half of the hot air to return into the case for 8800GTX should start looking for another job)
External video cards with their own power supply in a single box which can be daisy-chained, and connected to the computer via optical link (yes, whoever thought of those VHDCI connectors should also start looking for another job)

paulius · January 31, 2008, 4:02am

Please limit the discussion to CUDA features, there are appropriate forums for other requests.

Paulius

maxpower3141 · January 31, 2008, 11:11am

Another hardware item.

Often the memory access pattern is known a priori. It would be nice if the programmer could tell this to the cache controller, so that the cache controller could asynchronously prefetch the upcoming memory accesses, reducing cache misses down to – in many cases – zero.

I’m thinking that one could associate a few numbers with each kernel (or each streaming memory access of each kernel) that represent the memory access pattern, and these numbers could be sent to the cache controller when the kernel is invoked. Then, as the kernel requests data from the cache controller, the cache controller would use these numbers to figure out what memory addresses to prefetch, and do so.

Some pseudo-code for a possible memory access pattern, defined by the numbers (i0,i1,s1,i2,s2,i3):

data_type *p;

for( int j = 0; j < s2; j++)
for( int i = 0; i <  s1; i++)
read( p[(ii1+ji2)*i0]);

p += i3*i0;

Two related papers I found that might provide some background:

http://www.cs.utah.edu/classes/cs7940-010-…apers/lee03.pdf

http://portal.acm.org/citation.cfm?id=605462&dl=GUIDE&dl=ACM

[snapback]302278[/snapback]

I’d like to second this one - would help with the classical array-of-structs => struct-of-arrays problem.

And of course caching based on this for global memory; I just spent days reordering my data-types just in order to get coalescing for my global memory access, and of course it also upped my register pressure. :(

maxpower3141 · January 31, 2008, 11:14am

SW / Compiler feature:

Ability to tell the compiler where to write in device code ( shared vs. global ) and where to read (shared vs. global vs. constant) so that I can start using pointers / pointer arithmetics.

fretnaster84 · January 31, 2008, 11:27am

I think i just added Crysis to my games list

DenisR · January 31, 2008, 12:32pm

I guess you posted to a wrong forum :P

jimh · February 1, 2008, 6:41am

I would like to see memory from cudaMallocHost shared between contexts. Specifically, I would like to be able to have multiple GPUs perform DMA copies to or from the same host buffer.

I tried sharing cudaMallocHost buffers in my multi-GPU code, but kept running up against segmentation faults and memory leaks. To fix the problem, I figured I would need to allocate locked buffers for each thread. That quadruples my physical memory requirements, requires the extra memcpy’s Brian mentioned, and eats up CPU time at each cudaMallocHost call. I guessed the extra work by the CPU would probably offset any performance improvement a locked buffer would provide.

It would be wonderful if multiple devices could DMA data from the host concurrently - but I would get a performance boost even if devices must perform their DMAs sequentially.

cpr007 · February 3, 2008, 5:02am

I have one very simple request. A command line switch for nvcc to make error and warning message format in the “visual studio style” so that we can be lazy and double click on errors to be taken to them in the IDE.

mikemayo21 · February 3, 2008, 6:59am

im replying to all the posts

vpodlozhnyuk · February 5, 2008, 9:43pm

Not sure if VPU has dedicated hardware IDCT / DCT unit instead of software (CUDA) routine used by the driver, that would be an unreasonable waste of silicone.

You can do 8 x 8 DCT / iDCT in CUDA with great efficiency and full FP32 precision on your own…

DenisR · February 7, 2008, 9:10am

I also have some wishes:

a way to tell the compiler to NOT use local memory. I have a kernel that uses not that many registers, but still insists on using local memory.
the ability to mark a pointer as pointing to shared or global memory, to not get the advisory warnings.

AndreiB · February 7, 2008, 9:28am

I’d like to see MSVS2008 support in next release.

serge · February 7, 2008, 10:43am

MSVS2008 support in next release is the most desired feature for me.

DenisR · February 7, 2008, 6:26pm

I would like to have sinf, cosf, etc. that do not use local memory!
Also I think a warning in the Programming Guide is in order that these use local memory. I have been pulling my hairs out to find out where my local memory usage came from…

nwilt · February 7, 2008, 7:04pm

sinf/cosf only use local memory if the compiler cannot find the couple of registers they need, e.g. if you are compiling with a maximum register count.

Topic		Replies	Views
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134550	May 26, 2010
CUDA 4.0 CUDA Programming and Performance	63	507394	March 28, 2013
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9579	June 25, 2013
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210952	January 18, 2014
CUDA very slow performance CUDA Programming and Performance	21	16460	March 6, 2020
An Even Easier Introduction to CUDA Technical Blog	141	6094	November 28, 2023
CUDA 2.2 beta features CUDA Programming and Performance	146	126071	May 19, 2009
Unified Memory for CUDA Beginners Technical Blog	46	2498	December 1, 2023
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75857	February 9, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55443	March 8, 2007

Wishlist Place your considered suggestions here

Related topics