CUDA 5.0 suggestions..

*nvml api support for gpu reset as nvidia-smi and also windows support for reset that at least in tcc mode…
*nvidia uva suppor for windows wddm driver
*cuda jit like opencl (runtime compilation) now that is llvm based of course cuda code should be not intermixed with host code (similar to .cl files)… using that feature we shouldn’t need Visual Studio anymore for runtime compilation of cuda kernels (as currently is only needed for preprocessing code)
*cuda linker support like opencl 1.2
*3d image writes in opencl
*Of course all new Kepler stuff that I can’t know (hope in CES we get compute and graphics whitepapers)
but similar to amd core next I hope we can enqueue work from gpu/ make syscalls in kernel code / and support gpu interrputs so send work to cpu from gpu without finishing gpu kernels and without cpu polling

I hope also kepler gets:
*Suspend to graphics ram (Wddm 1.2 feature)
*improved thread switch context granularity: hope is at least /triangle/primitive/thread group/ vs current draw call better would be pixel/vertex and thread in compute…
*HW Video encoder similar to AMD VCE and expose via cuda video encoder api… (so existing apps work without recompilation)-> vreveal 3.2
Not GPGPU relevant:
*HW support for partial resident textures (aka: sparse textures, virtual textures, etc…)
*DP 1.2 XBR and HDMI 3GHZ
*DP DMM audio
*>256GB/s graphics RAM BW

Sorry before I forgot also some enhanced visual compute support (aka OpenGL/DX CUDA interop)
(some wishes reposted from OpenCL for realtime rendering: What's missing? | Anteru's Blog)
please include this in CUDA 5.x and OpenCL (in OCL as extensions):
No support for reading the depth buffer: Binding a depth buffer with 24-bit depth is not possible at all; binding a depth buffer with 32-bit depth stored as float still requires a copy between the depth buffer and a 32-bit float texture. This is just ridiculous, as the data is already on the GPU. Use cases for this are plenty: Every deferred shading implementation on the GPU wants access to the depth buffer to be able to compute the world space position. Being able to use a 32-bit depth texture would resolve 50% of the problems. The ideal case would be the ability to map (in DirectX parlance) DXGI_FORMAT_D24S8 and DXGI_FORMAT_R32_TYPELESS textures, the former because it provides best performance and the latter because it would allow to share the depth buffer between OpenCL and pixel shaders.
No mip-mapped texture support: OpenCL only allows to bind a single mip-map level of an image. I would definitely like to bind a full mip-map chain, for instance, implementing a fast volume-raytracer is much easier if I can access a mip-mapped min/max texture for acceleration. Using global memory to emulate mip-mapped data structures results in reduced performance and super-ugly code, especially if interpolation is used. There is some hope that this will be added, as the cl_image_desc has already a field num_mip_levels. An immediate use case for me is the already mentioned volume rendering, but there’s also a lot of image filtering things where access to all mip-map levels would be very helpful; plus some other uses cases as well (for instance, updating a virtual texture memory page table.) Even worse, it can be done already today, with super-ugly code that binds each mip-map level to an image object.
No multi-sampled image support: Reading MSAA’ed image is a must have for high-quality rendering, writing would be nice but is not that crucial. Again, support seems to be coming, the cl_image_desc has also a field num_samples. The main use case I have in mind is high-quality deferred shading, where I would definitely like to use an MSAA’ed frame- and depthbuffer.
[Update] Why is it important to access the depth buffer directly? Because you benefit from the hardware compression during reads (reducing the required bandwidth.) This is even more important for multi-sampled buffers, as the hardware compression can do really wonders there. After copying to a normal texture, the compression is lost.