Wishlist Place your considered suggestions here

Vista 64 bit is top of my list.

Second would be double precision support.

Third would be a more general architecture which would make it easy to implement non-vectorizable algorithms efficiently such as KD Tree traversal and full ray tracing.

I realize that NVidia is working on all of these things and probably in that order so for now I am a happy camper patiently waiting for my 64 bit Vista CUDA.

This introduces another reason why CUDA contexts should be able to communicate with one-another – If you allocate with cudaMallocHost, that memory is locked, but cannot be seen as locked by another context, therefore requiring at least an extra CPU-side memcpy for multiple GPUs to agnostically DMA and share information. Additionally, presumably it wouldn’t be too difficult to get device-to-device copies to happen if contexts were aware of one-another from the programming model.

A MUCH better solution, IMHO, is to have special flags that can be passed into the memcpy functions which says whether or not the memory is pinned, or in the cases where it is necessary, require the memory to be pinned. Yes, this means the user has to be smart or possibly crash his operating system. The extra nifty silver lining is that it should be even higher performance, because CUDA doesn’t have to check if the memory is locked.

This would allow seemless use with things like mlock and mmap(MAP_LOCKED) (and the windows equivalents), which would have been extremely helpful for me the last few months.

A few other requests:

  • smarter register allocation. A lot of my kernels are limits to 64 threads/block (and 33% or less occupancy) because of register limitations

  • along the same lines, more registers, INDEXABLE registers would be even better, but I suppose more shared memory could accomplish that

  • streams are nice, but another paradigm that could be very useful (and is supported by CTM), is a streaming computation, where the data isn’t required to be copied via cudaMemcpy, but rather the CUDA threads themselves read the data in while performing computation. This might help hide the latency of the copy, and would allow, for example, keeping a larger dataset in GPU memory by not keeping temporary data there (read from CPU mem, write to CPU mem).

Thanks for the consideration.

Brian

I don’t think it’s that simple. What cudaMallocHost does is not only allocate pinned memory but also register this specific area in the memory with the GPU driver. Therefore the driver will be able to ‘track’ the memory which allows the GPU to DMA into it.

So it can’t register the memory that is passed in? It seems that such a registration shouldn’t be all that expensive.

Yes of course !!! it always is in floats. I already get different results in the vector collapse code. Its not wrong results just different.

My wishes are as follows.

Software:

  1. cudaIDCT() and cuIDCT() APIs which would expose VPU hardware IDCT (and possibly dequantization) capability in a manner suitable for JPEG image decoding (a lot of 4000x3000 JPEGs out there these days)

  2. CUDA libraries with more performance primitives, not only FFT and BLAS.

NVIDIA Management:

  1. Searchable list of known CUDA, Cg, OpenGL and DirectX driver issues

ForceWare drivers:

  1. ForceWare Driver support for SSE4.1 instructions (DPPS, MOVNTDQA)

  2. Import/export functionality for driver settings including resolution, color depth, refresh rate, and image quality in 3D applications

  3. Working anti-aliasing in games

  4. Disabling trilinear and anisotropic sample optimization should also disable it for OpenGL applications

  5. Person who designed control panel should be taken to the back alley and shot – this deserves longer explanation:

  • Nested scrollbars are the most horrid UI element in known universe. Try doing something in the control panel with just a keyboard, or with a lousy touchpad.

  • That UI abomination should be redesigned so that everything fits in 1024x768 resolution without the need for two levels of scrollbars, a tree, and those ugly tabbed dialogs.

  • Dropdown lists are impractical because you don’t see other choices until you expand them.

  1. Reduce the number of things loading at startup. As far as I could see the service you install does nothing usefull at least for me. Ditto for nWiz, nThis and nThat obscured by launching with rundll32 so that you might think you got a trojan.

  2. Do not install the taskbar notification area icon, that area was meant for quick access to important features and application feedback but it got abused for marketing purposes. Ask yourself how often a regular user changes resolution, color depth, etc? That’s right – once when they install the driver and even that could be avoided if there was an import/export settings feature.

Hardware:

  1. Video cards with 1GB and 2GB of RAM which doesn’t cost an arm and leg

  2. Mainstream video cards with 16-bit color DACs per RGB channels

  3. Cooler running cards (8800GTX is a furnace)

  4. Smarter cooling solutions (whoever decided to allow half of the hot air to return into the case for 8800GTX should start looking for another job)

  5. External video cards with their own power supply in a single box which can be daisy-chained, and connected to the computer via optical link (yes, whoever thought of those VHDCI connectors should also start looking for another job)

Please limit the discussion to CUDA features, there are appropriate forums for other requests.

Paulius

I’d like to second this one - would help with the classical array-of-structs => struct-of-arrays problem.

And of course caching based on this for global memory; I just spent days reordering my data-types just in order to get coalescing for my global memory access, and of course it also upped my register pressure. :(

SW / Compiler feature:

Ability to tell the compiler where to write in device code ( shared vs. global ) and where to read (shared vs. global vs. constant) so that I can start using pointers / pointer arithmetics.

I think i just added Crysis to my games list

I guess you posted to a wrong forum :P

I would like to see memory from cudaMallocHost shared between contexts. Specifically, I would like to be able to have multiple GPUs perform DMA copies to or from the same host buffer.

I tried sharing cudaMallocHost buffers in my multi-GPU code, but kept running up against segmentation faults and memory leaks. To fix the problem, I figured I would need to allocate locked buffers for each thread. That quadruples my physical memory requirements, requires the extra memcpy’s Brian mentioned, and eats up CPU time at each cudaMallocHost call. I guessed the extra work by the CPU would probably offset any performance improvement a locked buffer would provide.

It would be wonderful if multiple devices could DMA data from the host concurrently - but I would get a performance boost even if devices must perform their DMAs sequentially.

I have one very simple request. A command line switch for nvcc to make error and warning message format in the “visual studio style” so that we can be lazy and double click on errors to be taken to them in the IDE.

im replying to all the posts

Not sure if VPU has dedicated hardware IDCT / DCT unit instead of software (CUDA) routine used by the driver, that would be an unreasonable waste of silicone.

You can do 8 x 8 DCT / iDCT in CUDA with great efficiency and full FP32 precision on your own…

I also have some wishes:

  • a way to tell the compiler to NOT use local memory. I have a kernel that uses not that many registers, but still insists on using local memory.
  • the ability to mark a pointer as pointing to shared or global memory, to not get the advisory warnings.

I’d like to see MSVS2008 support in next release.

MSVS2008 support in next release is the most desired feature for me.

I would like to have sinf, cosf, etc. that do not use local memory!
Also I think a warning in the Programming Guide is in order that these use local memory. I have been pulling my hairs out to find out where my local memory usage came from…

sinf/cosf only use local memory if the compiler cannot find the couple of registers they need, e.g. if you are compiling with a maximum register count.