Wishlist Place your considered suggestions here

Yeah, Appendix A of the programming guide is extremely useful… it’s the most used reference in the manual… yet it has no contents link in the PDF, it’s really a time waste to scroll for it every time.

I also agree with html, it would make searching easier too, plus instant access from all computers, plus google style searching…

in VS2005, right click on a variable, “go to definition”:)
instead of ctrl+shift+F everytime:)

seconded. matrix inversion would be much appreciated. inversion of large arrays of (same sized) matrices, even better.

I’d like to add an update on the expected release date for version 2 to the wish list.

I Hope that In next version,
Access OPENGL Texture Object directly But not using PBO in CUDA,
Enhance OpenGL Interoperability.

I’d like to see some documentation on cutil.

Can’t you just look in cutil.h? It’s got doxygen-style commenting in there.

I’d like to see a “secret” registry key that enables the use of CUDA on Vista for devices that don’t have the desktop extended to it. By making the reg key undocumented and inofficial, the drivers still pass WHQL certification because Microsoft wouldn’t stumble across this option during testing.

A similar key could be provided to disable or configure the watchdog timer, given that this is under control of the video driver (and not controlled by Windows itself)

  1. eigen decomposition for very large (dense) matrices
  2. fastICA (independent component analysis)

(library or code samples for highly efficient, highly parallel implementations.)

“static” keyword support would make my life much easier.

Kernels do NOT support static data, and that’s an understandable limit of the hardware model. The compiler enforces this by disallowing the use of the static keyword in every variable declaration.

But this sucks because half the use of “static” is with constant definitions, which are a compiler and coding idiom, not an execution model.
So a nice concise type-checking method of constants like:

static const int LoopCount=1000;
int i;
for (i=0; i<LoopCount; ++i) { do something }

is illegal! Yes, you can rewrite it with a #define, or even use a real variable and hope the compiler optimizes it away, but this means you now have to have two versions of some functions. And for a C++ class, the static constants are especially useful since your only option is the #define.

So please let us define static const class and local function variables, even
of just primative int, float, and double types. Such a definition won’t interfere with the generated code, it will just make our source much cleaner and safer and allow us to reuse functions on both CPU and GPU.

I’d like to see some sort of File Exchange where people can submit/rate kernels from the community. Something similar to the “Matlab Central File Exchange” would be great. Because I don’t think people many people would hesitate submitting it there but do hesitate posting it in the forum. Right now there is only the few SDK samples and a couple of code published with some papers but I’d like to see a central file exchange site.

It would be nice to have a compiler flag to disable the use of the fmad instruction.

I want shared to be applicable to pointers. In CUDA a pointer needs a piece of information regarding which memory space it points to. This is all inferred automatically and works well. Unless you try fancy stuff with structs (or sometimes for no reason) and you get an error message like “no idea what this pointer points to, assuming global memory.” It should be possible to tell the compiler in such situations.

I want a local keyword. This would also be useful with pointers. Even more useful it would be with file-scope variables! Why are all file-scope variables perversely shared between threads, with no recourse to stop this? Would make porting code much easier. There’s some uses re optimization as well.

I want a register keyword. This would force a variable or array into registers. If this is not possible (eg, array is dynamically indexed or there’s a maxrreg cap) the compiler would tell you immediately.

CUDA’s memory heirarchy is its most important facet. It’s an outrage that the programmer has such uncertain control over it. This only adds to confusion.

Also, local memory should be faster. Every access to local memory should generate a coallesced read from DDR. E.g., if you access myLocalArray[i], a fetch should be generated to myLocalArray[i][threadIdx.x]. Testing bears out that local memory is not arranged in this obvious way. (Maybe like localMem[threadIdx.x].myLocalArray[i])

Yes. This would be nice.

I’d like a __vote() primitive, returning a warp-wide bitmask of thread booleans.

Compute capability 1.2 introduced the __any() and __all() vote primitives. These can be useful, allowing you to switch a warp’s behavior based on a single thread’s state.

I have a hardware wishlist request, a generalization of these primitives to a more general warp-wide communication method. __vote() would evaluate a boolean (0 or nonzero) status flag per thread, and return a 32 bit condition word to all threads where each bit of the condition word is the bit provided by the corresponding thread.

This warp-wide communication method could be extremely useful especially in doing compaction, expansion, or reduction.

A prime example of how you might use it would be if your warp may have some threads that wish to read or write a result from their thread to/from a linear list in shared (or global) memory. You want the first thread with data to write a result to memory location X, the second thread with data to write to memory location X+1, the third one with data to write to X+2, and so on. This is a standard compaction algorithm… but that’s a lot of work. The standard method is to use 32 shared memory words and run a multiple pass sum-prefix algorithm on it using 5 passes, using about 15 operations with all threads. But if you have some diverged threads, you can’t even do a normal reduction, you have to use a pessimistic algorithm that is robust to divergence. Such an algorithm also uses 32 words of shared memory but about 30 operations.

But with a vote primitive, we can do the sum reduction with no shared memory by doing a bit count!

z=__vote(I_have_data);  // boolean flags 

z=z&(0xFFFFFFFF<<threadnum); // mask out bits from myself and higher threads

index=__popc(z); // count the number of threads below me.

Boom, we did a warp-wide compaction with three lines of code (condensible to one!) which works even with diverged (disabled) threads and NO shared memory. A current alternative method is to use shared memory atomics, but this could take 32 operations, assuming atomics are one clock, which probably is an underestimate.

This same compaction allows us to do reductions or sum-prefix operations on warps with diverged threads. We compact the warp first with the above three lines of code, then use that renumbering to address temporary shared memory indices for a standard reduction or sum-prefix. Since we “filtered out” the disabled threads, we can now use the classic algorithms after the reordering.

Quite often you’re even doing simple integer reductions of just one bit anyway… “count how many threads have some state.” That’s simple, just __popc(__vote(x)). This is useful for choosing what kind of computation to iterate a warp over… you can do something like if 16 or more threads have state X, you choose the computation that works on that most popular kind of thread.

If you need a sum reduction with a very limited numeric range, something like

sum=__popc(__vote(x>>1))<<1  + __popc(__vote(x&1));

returns a warpwide sum reducion of perthread values in the range of 0-3.

We can also do warp-wide atomics with __vote(). The existing __ffs() function returns the index of the first set bit of a value. Therefore we can easily return one unique warpwide thread index with a simple: __ffs(__vote(have_work)). You might use this to have all threads react to the single thread’s state, or just as a simple test to disable all threads except one. You could also use a while() condition to sequentially iterate over the threads of interest.

If other algorithms have to deal with potentially diverged threads, it may be useful to use the simple result of __vote(1) to learn which threads are diverged or not… the return value is a bitmask of the currently active threads. This is something that’s been asked for: http://forums.nvidia.com/index.php?showtop…ndpost&p=447582

A less common use might be to use a single warp to do arbitrary bit swizzles, reversing, permuting, or duplicating any bit. An example of reversing bits would be just: reverse=__vote(value & (1<<(31-threadnum))). This produces one one warp-wide reversal, not a per-thread one, but it can still be useful. The alternative is some messy code: http://forums.nvidia.com/index.php?showtop…ndpost&p=408991

My main motivation for a __vote() primitive is to clean up my current code which uses a whole page of code just to dump completed results from subsets of threads to device memory. The compaction is quite ugly and unintuitive especially when you may have diverged threads. A __vote() primitive would give me a speedup, as well as enormously simpler code (one line, not 30.)

Is __vote() possible now? Certainly. The two strategies (both using some temporary shared memory storage) are using shared atomics, or a diverged-thread-aware reduction. But the main use of __vote() would be to replace these methods with something simpler, faster, and easier.

I’d like to be able to have an array of texture references instead of having to declare each one statically
i.e.
<unsigned int, 2, cudaReadModeElementType> textures[4];
instead of
<unsigned int, 2, cudaReadModeElementType> texture1;
<unsigned int, 2, cudaReadModeElementType> texture 2;
<unsigned int, 2, cudaReadModeElementType> texture 3;
<unsigned int, 2, cudaReadModeElementType> texture 4;

and then being able to do

tex1Dfetch(textures[i],index);
where i is a kernel parameter.

Since textures are limited to 128MB in linear memory, and cudaArrays are substantially slower, this would allow multiple buffers of 128MB input to be processed by separate streams concurrently (stream i uses texture reference i as input)

Either that, or increase the linear memory allowance so that 1GB of global memory can be used more effectively.

EDIT: I realise greater use of page-locked memory may cause system degredation overall, but at least allow each application to obtain its own optimum amount (since any host with > 2GB RAM can easily feed 128MB input size)

My current program spends 92% of total time loading input data to the kernel, 6% processing, and 2% copying results. It’s a killer!

Half precision 16 bit float load / save support would be of huge benefit for iterative solvers. The hardware supports it, is there any timeframe for this?

Top of my wish list are memory optimization directives:

register : will force variable into registers, even short arrays. This one is most important wish.

shared_local : will force variable LOCAL to one thread into shared memory. This one is “nice to have” wish.

nodevice : combination of register and shared_local where compiler can decide. This one is least important wish.

I would guess that most important of these directives (register) is also easiest to implement.

Usage scenarios:

Those directives should help especially those scenarios where single thread functions are just ported to kernel functions in order to use multiple processors, and where you dont have multiple threads working on common large data structures. One example area is doing simulations on GPU. There you have reduced data as input (usually simulation bounds and seeds), reduced data as output, and single kernel thread tend to run long usually without need for access to large data structures. Therefore avoiding usage of local/device memory is possible and could result in huge performance improvements.

Basically, it can benefit any algorithms that can be executed in parallel but do not have large data I/O.

register explanation:

This one is needed because compiler right now will put any long structure or array in local memory. So if i have kernel that need something like this:

int seed,a,b,c;

int n[8];

for (int i=0; i<7; i++) n[i]=0; // example of accessing array

n[seed % 8]+=a;

this should fit easily in 16 registers per kernel thread, but unfortunatelly compiler will put n[8] into local memory, significantly slowing down kernel. There is no elegant workaround for this - I use right now:

int seed,a,b,c;

int n0,n1,n2,n3,n4,n5,n6,n7=0;

switch(seed % 8){

 Â case 0: n0+=a; break;

 Â case 1: n1+=a; break;

 Â ...

 Â case 7: n7+=a; break;

}

Above is just example for accessing short array elements, not real kernel,and while it is very unreadable and against my normal C practices, it still perform hundreds of times faster than letting compiler put array in local memory.

So what I want here is:

__register__ int seed,a,b,c;

__register__ int n[8];

for (int i=0; i<7; i++) n[i]=0; // example of accessing array

n[seed % 8]+=a;

and if compiler can not fit those with register into register, it should report error.

shared_local explanation:

This one is bit harder to explain or implement than above, but has same reason - avoiding local/device memory. In cases where I need more “register” memory for single kernel thread than I have registers, I would like to be able to use shared memory as local thread memory. I can do that even now for arrays by indexing into shared memory by threadID, but I can not do it for int,float,struct per thread , and also shared memory is optimized right now to avoid bank conflicts from multiple threads, which is not good if local usage is wanted.

For example if i have following variables local to thread (that i would like to be in shared memory only because register memory is full):

int seed;

int n[32]; 

int a;

for (int i=0; i<32; i++) n[i]=seed+a*i; //example of usage

I could right now workaround that by indexing into shared array, like:

__shared__ int  shared[];

...

const numberOfRegularSharedInts=1024;

const threadSharedLocalSize=1+32+1;

int idx=threadIdx.x*threadSharedLocalSize+numberOfRegularSharedInts;

#define map(k) shared[idx+k]

#define seed map(0)

#define n(k) map(1+k)

#define a map(33)

for (int i=0; i<32; i++) n(i)=seed+a*i; //example of usage

but that have disavantages like:

  • complicated to write

  • error prone (need carefully to add indexes)

  • lower performance due to addition and indexing, so simple a=3 instead of assigment to static address has 3 basic instructions: addition (idx+1), indexing , and assigment

  • most important, it is not optimized to avoid bank conflicts

This last one is most important reason why doing it by hand is hard - right now consecutive addresses in shared mem are in different banks, to avoid conflicts of multiple threads working on consecutive data elements. BUT in this case we want single thread to work on consecutive addresses. So in above example, n(0)=1 in tread#0 can clash with n(0)=3 in thread#1.

Solution to that would be putting consecutive integers from single thread in same bank (bank being 16 integers if i understood correctly).

so above example would look something like:

__shared__ int  shared[];

...

const numberOfRegularSharedInts=1024;

const threadSharedLocalSize=48; // need to be divisible by 16

int idx16=threadIdx.x%16;

int idxM=threadIdx.x/16;

#define map(k) shared[numberOfRegularSharedInts+ idxM*16*threadSharedLocalSize+idx16+k*16]

#define seed map(0)

#define n(k) map(1+k)

#define a map(33)

for (int i=0; i<32; i++) n(i)=seed+a*i; //example of usage

What I’m suggesting here is much simpler:

__shared_local_ int seed;

__shared_local_ int n[32]; 

__shared_local_ int a;

for (int i=0; i<32; i++) n[i]=seed+a*i; //example of usage

Advantages of that would be:

  • complier would work math to decide indexes, thus eliminating error chances

  • compiler can map non-array integers to fixed addresses, so “int a;” would map to “shared[1316]” at compile time, thus making is as fast as if a was register vatriable (as opposed to 2-3 times slower if shared[1000+n] was done)

  • compiler could even map structs in above way (doing that by han dwould be nightmare)

  • compiler can “upgrade” _shared_local to register whenever enough registers are present

  • it would be optimized for avoiding bank conflicts

nodevice explanation:

This one is not so esential as other two . It is combination of register and shared_local where compiler can decide which one to use, as long as it is not device local memory.

Some guidelines for compiler could be:

  • give preference to register just as it does now

  • spread evenly usage of registers and _shared_local , thus increasing posibility of more concurent threads

For example, if my kernel function need 768 nodevice integers, and if total registers (per multiprocessor) are 8192 int32, and lets say total shared memory is 4096 int32 per multiprocessor, then compiler could split variables into:

  • 512 int32 into registers, using it for more important variables

  • 256 int32 into _shared_local, using it for less important, arrays, etc

This way one kernel thread would use 1/16th on both registers and shared pool, allowing 16 threads to run at same time in one mutiprocessor. Without such “smart” split, if all were register for example, only 10 threads would run in parallel.

This split could be done manually, but since number of registers and amount of shared memory per multiprocessor will change across CUDA devices, if compiler know for what type of device it compilers (say it has 1.1 directive), it could optimize . When tomorrow we change for 2.1 device, it could reoptimize without need for programmer to revisit every kernel function (which would be needed in case of manual balancing)

I ended up writing some nice macros for doing local shared memory as you say. They even worked well with structures (by using the offsetof() macro). The problems ended up these: You’ve got a lot less shared memory than registers. 4x less on the gt200. And shared memory is significantly slower than registers. Indexing calcs take up a lot of room, but even without those it’s fundamentally slower by 50%.

But, what you’re saying, if it was automated instead of introducing new keywords, would be cool. For example, any array that’s dynamically indexed just can’t go into registers, even if it’s only 8 entries. If you’re not using the smem for anything else, the compiler might as well give you 25% (or 50%) more “registers” before spilling into local.

Basically, whenever the compiler wants to use local memory, it should see if it can steal some smem. I’m not in support of a separate smem_local or nodevice keyword.