CUDA 4.1 suggested improvements.

oscarb · March 23, 2011, 5:04am

Reviewing my cuda 3.3 improvements some are presently in cuda 4.0 see below…

Still aren’t present in 4.0 so I hope implemented for 4.1:
*3d surface writes
*2d surface arrays (or surfaces of 2d layered textures in cuda 4.0 parlance)
*Some form of pointing grid and block sizes from gpu mem similar to DispatchIndirect in directcompute
*True Visual studio 2010 compiler support (not current hack)
*Apple Lion developer seeds support
*gcc 4.5 support?

Also OpenCL support is getting old we need:
*full CL 1.1 support
*3d_image_writes extension (amd supports in gpu’s and apple lion in cpu mode also)
*Some extension exposing 2d image arrays and related 2d_image_array_write extension
*Similar DispatchIndirect extension
*Pinned host mem equivalent for OpenCL
*clCreateEventFromGLsyncKHR efficient ogl interop

On OpenGL:
*arb_cl_event extension support

improvements cuda 4.0
*3d grid support now ptx files contain correctly code but runtime fails
*2d texture arrays
*cuda-gdb fermi macos support…
additionaly:
*surfaces can be created from ogl/dx textures
*copy between contexts possible…
add to that new hack for using tcc mode in windows or nvidia saying now gpudirect 2.0 supported in geforce line (maybe still only in linux world since at least unified addressing seems impossible in wddm alto gpudirect 2.0 can work without unified adressing also…)

Thanks

pium · March 23, 2011, 10:26am

+1

wanderine · March 23, 2011, 2:33pm

I haven’t checked CUDA 4.0 yet so I don’t know if this has been implemented but I guess not, I would like to see these features

4D textures
4D FFT
Easy 3D and 4D indexing

Cyril_Crassin · March 23, 2011, 6:03pm

I want 5D textures, 5D FFT and easy 4D and 5D indexing

Cyril_Crassin · March 23, 2011, 6:04pm

More seriously :

Support for 2D Layered texture in CUDA 4.0 is really great … But I would love to get surface access on them !
Compiler option to force non-tagged functions to be host device by default, instead of host only. Seems especially easy for inlined functions.
→ That would allow to use third party libraries (like Boost, Loki, an external vector library …) from kernels !

sergeyn · March 23, 2011, 7:30pm

Per-kernel specification of some compiler flags (like the one to disable L1 for global reads/writes) would be nice, similar to launch_bounds extension.

wanderine · March 23, 2011, 8:02pm

?

I am serious, I work with 4D medical image processing.

Simon_Green · March 24, 2011, 8:39am

Request noted, but this isn’t really a CUDA feature request - the hardware doesn’t support 4D texturing (and I doubt it ever will).

Interestingly, SGI did have an extension for this back in the old days (mainly intended for color look up tables, apparently):

http://www.opengl.org/registry/specs/SGIS/texture4D.txt

YDD · March 24, 2011, 1:53pm

Something which would be nice would be if the grid dimensions could be four bytes, rather than the current two. I have to slice up a 9D space sometimes, and so I end up packing that into 1D. And the length of that dimension often exceeds 65536, meaning that I have to translate the 9D index into 1D, and then split that 1D index back up into 2D. Note that I don’t actually need more blocks per grid, just more freedom in indexing them. I’m thinking of something like the thread indices themselves, where (at least pre-Fermi) you could have 512 threads in a block, and place them all along the x direction if you wanted.

wanderine · March 24, 2011, 6:27pm

Ok 4D textures is not the top priority from my side, but a 4D FFT would be nice. I’ve implemented one myself, as four 1D FFT’s, but I’m sure that you can do it better.

Cyril_Crassin · March 24, 2011, 7:33pm

A device-side linker would also be an awesome feature for CUDA 4.X !

hamster143 · March 24, 2011, 11:04pm

Lattice QCD tasks could make use of 6D textures (e.g. a 64x64x64x64x4x4 array of floats.) And I’m being (half) serious.

Is the 2^27 limit on the size of 1D textures hardware or software driven?

oscarb · March 24, 2011, 11:39pm

More things:
*One is if cuda video encoder library could be ported to linux similar to how cuvid is supported now on Windows (only C API not directshow of course )
Related to cuvenc seems vc1 decoding has been disabled (in 265.xx series and later) after been featured in CUDA 3.2… could nvidia explain why? is being to be reenabled?
Lastly seems from CUVID headers seems MVC decoding is exposed in headers would be awesome expanding cuvid sample to show it’s use…
that would expose MVC support in hardware in Fermi for Linux as seems still not supported in VDPAU…
Finally one note…
related to AMD optimized OpenCL BLAS and FFT libraries could Nvidia some CUDA/OpenCL interop extension i.e. some way to map ocl buffers to cuda pointers and viceversa so we can make use of CUFFT and CUBLAS and more from OpenCL world… using that and some common wrapper around Nvidia and AMD BLAS and FFT libs we could achieve almost single source “performance portable” codes using blas and fft libs…

oscarb · March 24, 2011, 11:40pm

More things:
*One is if cuda video encoder library could be ported to linux similar to how cuvid is supported now on Windows (only C API not directshow of course )
Related to cuvenc seems vc1 decoding has been disabled (in 265.xx series and later) after been featured in CUDA 3.2… could nvidia explain why? is being to be reenabled?
Lastly seems from CUVID headers seems MVC decoding is exposed in headers would be awesome expanding cuvid sample to show it’s use…
that would expose MVC support in hardware in Fermi for Linux as seems still not supported in VDPAU…
Finally one note…
related to AMD optimized OpenCL BLAS and FFT libraries could Nvidia some CUDA/OpenCL interop extension i.e. some way to map ocl buffers to cuda pointers and viceversa so we can make use of CUFFT and CUBLAS and more from OpenCL world… using that and some common wrapper around Nvidia and AMD BLAS and FFT libs we could achieve almost single source “performance portable” codes using blas and fft libs…

YAK · April 7, 2011, 6:18pm

+1

devkec · April 8, 2011, 6:36pm

auto and lambdas from C++0x

would be so helpful. there are more great c++0x features, but these 2 are the greatest !!! (3rd is variadic templates, regarding thrust)

Ola · May 24, 2011, 12:53pm

How about

the ability to map depth buffer textures.
be able to sample opengl textures, in a black box kind of way.
CSAA/MSAA surfaces would be fun too.

Cheers
.ola

You7878 · September 20, 2011, 7:53pm

What about atomics on doubles (64 bit floating point numbers)?

alrikai · September 20, 2011, 8:23pm

Guess it’s your lucky day; you don’t even have to wait for CUDA 4.1 for that one. Taken from the (4.0) programming guide, section B.11:

“Note however that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). For example, atomicAdd() for double-precision floating-point numbers can be implemented as follows:”

__device__ double atomicAdd(double* address, double val) { 

  unsigned long long int* address_as_ull = (unsigned long long int*)address; 

  unsigned long long int old = *address_as_ull, assumed; 

do { 

    assumed = old; 

    old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); 

  } while (assumed != old); 

return longlong_as_double(old); 

}

matt42 · September 22, 2011, 1:13pm

improve C++98 compilation time!
nvcc can be really really slow when it comes to meta programming.
(A lot slower than gcc for example)

Thanks

Matthieu

Topic		Replies	Views
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	211000	January 18, 2014
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134577	May 26, 2010
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204319	April 13, 2009
multi dimension array CUDA Programming and Performance	26	32777	February 12, 2010
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75860	February 9, 2009
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26031	September 21, 2010
CUDA very slow performance CUDA Programming and Performance	21	16750	March 6, 2020
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8167	February 17, 2015
CUDA 4.0 CUDA Programming and Performance	63	507400	March 28, 2013

CUDA 4.1 suggested improvements.

Related topics