CUDA 4.1 suggested improvements.

Reviewing my cuda 3.3 improvements some are presently in cuda 4.0 see below…

Still aren’t present in 4.0 so I hope implemented for 4.1:
*3d surface writes
*2d surface arrays (or surfaces of 2d layered textures in cuda 4.0 parlance)
*Some form of pointing grid and block sizes from gpu mem similar to DispatchIndirect in directcompute
*True Visual studio 2010 compiler support (not current hack)
*Apple Lion developer seeds support
*gcc 4.5 support?

Also OpenCL support is getting old we need:
*full CL 1.1 support
*3d_image_writes extension (amd supports in gpu’s and apple lion in cpu mode also)
*Some extension exposing 2d image arrays and related 2d_image_array_write extension
*Similar DispatchIndirect extension
*Pinned host mem equivalent for OpenCL
*clCreateEventFromGLsyncKHR efficient ogl interop

On OpenGL:
*arb_cl_event extension support

improvements cuda 4.0
*3d grid support now ptx files contain correctly code but runtime fails
*2d texture arrays
*cuda-gdb fermi macos support…
additionaly:
*surfaces can be created from ogl/dx textures
*copy between contexts possible…
add to that new hack for using tcc mode in windows or nvidia saying now gpudirect 2.0 supported in geforce line (maybe still only in linux world since at least unified addressing seems impossible in wddm alto gpudirect 2.0 can work without unified adressing also…)

Thanks

+1

I haven’t checked CUDA 4.0 yet so I don’t know if this has been implemented but I guess not, I would like to see these features

4D textures
4D FFT
Easy 3D and 4D indexing

I want 5D textures, 5D FFT and easy 4D and 5D indexing

More seriously :

  • Support for 2D Layered texture in CUDA 4.0 is really great … But I would love to get surface access on them !
  • Compiler option to force non-tagged functions to be host device by default, instead of host only. Seems especially easy for inlined functions.
    -> That would allow to use third party libraries (like Boost, Loki, an external vector library …) from kernels !

Per-kernel specification of some compiler flags (like the one to disable L1 for global reads/writes) would be nice, similar to launch_bounds extension.

?

I am serious, I work with 4D medical image processing.

Request noted, but this isn’t really a CUDA feature request - the hardware doesn’t support 4D texturing (and I doubt it ever will).

Interestingly, SGI did have an extension for this back in the old days (mainly intended for color look up tables, apparently):

http://www.opengl.org/registry/specs/SGIS/texture4D.txt

Something which would be nice would be if the grid dimensions could be four bytes, rather than the current two. I have to slice up a 9D space sometimes, and so I end up packing that into 1D. And the length of that dimension often exceeds 65536, meaning that I have to translate the 9D index into 1D, and then split that 1D index back up into 2D. Note that I don’t actually need more blocks per grid, just more freedom in indexing them. I’m thinking of something like the thread indices themselves, where (at least pre-Fermi) you could have 512 threads in a block, and place them all along the x direction if you wanted.

Ok 4D textures is not the top priority from my side, but a 4D FFT would be nice. I’ve implemented one myself, as four 1D FFT’s, but I’m sure that you can do it better.

A device-side linker would also be an awesome feature for CUDA 4.X !

Lattice QCD tasks could make use of 6D textures (e.g. a 64x64x64x64x4x4 array of floats.) And I’m being (half) serious.

Is the 2^27 limit on the size of 1D textures hardware or software driven?

More things:
*One is if cuda video encoder library could be ported to linux similar to how cuvid is supported now on Windows (only C API not directshow of course )
Related to cuvenc seems vc1 decoding has been disabled (in 265.xx series and later) after been featured in CUDA 3.2… could nvidia explain why? is being to be reenabled?
Lastly seems from CUVID headers seems MVC decoding is exposed in headers would be awesome expanding cuvid sample to show it’s use…
that would expose MVC support in hardware in Fermi for Linux as seems still not supported in VDPAU…
Finally one note…
related to AMD optimized OpenCL BLAS and FFT libraries could Nvidia some CUDA/OpenCL interop extension i.e. some way to map ocl buffers to cuda pointers and viceversa so we can make use of CUFFT and CUBLAS and more from OpenCL world… using that and some common wrapper around Nvidia and AMD BLAS and FFT libs we could achieve almost single source “performance portable” codes using blas and fft libs…

More things:
*One is if cuda video encoder library could be ported to linux similar to how cuvid is supported now on Windows (only C API not directshow of course )
Related to cuvenc seems vc1 decoding has been disabled (in 265.xx series and later) after been featured in CUDA 3.2… could nvidia explain why? is being to be reenabled?
Lastly seems from CUVID headers seems MVC decoding is exposed in headers would be awesome expanding cuvid sample to show it’s use…
that would expose MVC support in hardware in Fermi for Linux as seems still not supported in VDPAU…
Finally one note…
related to AMD optimized OpenCL BLAS and FFT libraries could Nvidia some CUDA/OpenCL interop extension i.e. some way to map ocl buffers to cuda pointers and viceversa so we can make use of CUFFT and CUBLAS and more from OpenCL world… using that and some common wrapper around Nvidia and AMD BLAS and FFT libs we could achieve almost single source “performance portable” codes using blas and fft libs…

+1

auto and lambdas from C++0x

would be so helpful. there are more great c++0x features, but these 2 are the greatest !!! (3rd is variadic templates, regarding thrust)

How about

  • the ability to map depth buffer textures.
  • be able to sample opengl textures, in a black box kind of way.
  • CSAA/MSAA surfaces would be fun too.

Cheers
.ola

What about atomics on doubles (64 bit floating point numbers)?

Guess it’s your lucky day; you don’t even have to wait for CUDA 4.1 for that one. Taken from the (4.0) programming guide, section B.11:

“Note however that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). For example, atomicAdd() for double-precision floating-point numbers can be implemented as follows:”

__device__ double atomicAdd(double* address, double val) { 

  unsigned long long int* address_as_ull = (unsigned long long int*)address; 

  unsigned long long int old = *address_as_ull, assumed; 

do { 

    assumed = old; 

    old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); 

  } while (assumed != old); 

return longlong_as_double(old); 

}

improve C++98 compilation time!
nvcc can be really really slow when it comes to meta programming.
(A lot slower than gcc for example)

Thanks

Matthieu