Wishlist Place your considered suggestions here

My wish list:
*kernel programming capability
*Reduced kernel invocation latency

Details here:
[url=“http://forums.nvidia.com/index.php?showtopic=70231”]http://forums.nvidia.com/index.php?showtopic=70231[/url]

I have a situation where I have a lot of pinned memory that needs to be pinned for a special device. I can’t set it up by getting the pinned memory from the CUDA world, I have to allocate it shortly after booting to get enough of it.

What I’d like is a way to register this pinned memory with CUDA to get the transfer speed benefit.

It’s possible that this is already available in some obscure way, but I’ve not been able to find it.

I’d like some sort of pinned/device memory manager that lets me specify all the mallocs I want to do, automatically malloc pinned host in one big alloc, then all the device memory in one big alloc, and assign the host / device memory pointers into that.

Potentially some memory defragging tools … dunno if they’re needed.

Also, I’d like more info on how/when exactly pinned memory is freed correctly or not, for instance when process w/ pinned memory is terminated w/ a signal.

I’ve said it on this forum before, but I’ll say it again here…

My wish is for routines for calculating matrix division - these are the lapack sgetrf/sgetrs that:

*  SGETRS solves a system of linear equations

*     A * X = B  or  A' * X = B

*  with a general N-by-N matrix A using the LU factorization computed

*  by SGETRF.

LU factorization has been mentioned here, and there are recent research papers on the subject, but nothing available now that I can find. It sure seems to me to be not that out of reach to get something together for this - many of the BLAS routines needed are already available. I’ve looked into it, but I don’t have the time to follow through to do it.

I suspect that the additional BLAS and LAPACK routines are in the works already, however.

Now that 3D texture access have been provided, what I really need is direct texture sharing between OpenGL and CUDA (without copy into a PBO) and direct texture WRITE into CUDA.
I have in fact an application manipulation very big 3D textures (close to video memory size) in OpenGL, and I would like to port certain portions of rendering code to CUDA, but keeping other portions intensively using early-z, rasteriser interpolation etc. in OpenGL. The problem is that these textures are too large to be able to have a second copy of them on the GPU and that copy would take too much time.
Is this functionalities (direct texture sharing and texture write) planed to be implemented in future releases ?

PS: It seem that texture write have been planed in PTX with the surf. state space, but not implemented yet :(

Software:

  • Big wish: Better register usage - this is the major reason why my raycaster cannot use the card’s full performance. Perhaps by a compiler switch that allows to further use shared mem as registers - in my case I don’t use shared mem so it would be great. A switch to set the desired maximum of used registers would also be very helpful. The generated code would have more memory accesses of course, but if the GPU occupancy can be increased it might be worth. Another idea would be compiler defines for manually using registers, like REG0.

  • faster malloc/free. The current one takes like 20ms for 1 allocation. The conventional malloc in a comparison is much faster (less than 1 ms) .

  • Allow texture arrays - tex2D doesnt allow something like tex2D(tex_array[0], … )

  • The nvcc compiler does not automatically recompile if timestamps of include files have changed (perhaps just my case)

Hardware:

  • More registers

  • MIMD instead of SIMD (but that might not be in near future I suppose…).

  • 1st and 2nd level cache for global memory

regards, Sven

–maxrregcount=xx

It is make that is responsible for that. You have to state in your makefile all the header files that your .cu file is depended on.

GTX260/280 has twice as many registers as G80/92

How about in-kernel async device<->shared memory transfers.

A thread posts a request to “blit” say 16 words at a time between device memory and shared memory. But the threads keep executing, and can query whether or not the transfer has finished yet. The goal is to reduce the impact on latency limited apps that can intelligently prefetch data they know they’ll be needing later.

It’s kind of similar to a manually-controlled L1 cache, you can request a prefetch and you also decide when to invalidate your storage or reuse it.
This is also sort of similar to some of the control you have over a Cell SPE’s caches.
(this memory control is one of the most frustrating parts of Cell programming, yet it’s also key to really efficient computation sometimes).

I tried to fake this functionality by having threads on one warp do my computation, and they post “requests” to shared memory. Another warp of threads look for those requests and then fetch and fill from device memory. A little dangerous ballet… you have to use special memory tags to poll to see if a change has been received or completed, and you also have to AVOID syncthreads(). Kind of like changing the spark plugs on a running engine… definitely stretching the abstraction.

My technique worked but unfortunately it wastes too many registers . The “memory worker” warps don’t need any, but all the threads in a kernel are assigned the same number anyway. It turned out to be better to just run more threads and have the scheduler hide the latency delay by running more warps.

Can you not already do exactly the same by fetching all memory you need in the beginning of your kernel? You can load data into shared memory, start doing the calculations you need, and only put the syncthreads() just before you start using the shared memory that needed to be fetched. I don’t see the difference with what you propose.

[OFFTOPIC]
To tell you the truth, I have been glancing over some doc’s on the CELL (the board with 2 CELL’s on it) because a co-worker is enthusiastic about it, but to me it all looks:

  • complicated (very complicated)
  • not easy to program, especially to get good performance
  • expensive compared to a graphics card (although for GPU you need a PC around it)

And the memory-bandwidth is considerably lower than a GPU, so when you get it as fast as the mem-bandwidth, you are still faced with lower performance as on GPU (where reaching mem-bandwidth looks easier to me) I think there will be algorithms that will run better on CELL than on GPU, but it will require a lot more work to make it so. But then again, I have not spent nearly as much time looking at CELL as at CUDA.
[/OFFTOPIC]

You could if you knew in advance what your needed memory was… but what if you’re using it similar to a cache, your data is really 50MB, so you’re dynamically swapping in the data as needed and that can’t be predicted at the start of your kernel. Constant memory helps, but it may not be big enough, it may need to be accessing device memory that was set up by the kernel already, and sometimes you need to go the other way and build up data in shared memory and then dump it out.

The main idea is to avoid syncthreads, especially when you have only a few threads that need data. If every thread needed data, then you may as well just use the whole warp for fetching. But with caching, often 95% of the time you have the data already and in fact you can make it 99.9% with a prefetch hinting.

CUDA’s design is really elegant because with enough warps, the memory latency is smoothly hid from you. But unfortunately often register limits prevent you from just loading up on more threads. Tradeoffs tradeoffs tradeoffs, always!

Yes, Yes, and Yes. These are really true. Though you can buy a PS3 and get a quite reasonable compute box… that bang-for-buck was incredible 2 years ago.

The SPEs are POWERFUL, but so finicky you really need to plan very very detailed strategies to use them effectively. CUDA has its own large set of details to learn, but it’s twice as complex with SPE programming, and not nearly as nice dev tools.

I’ve only dealt with Cell coding once, under contract from a game company to implement some algorithms for them. Very interesting just for the change of viewpoint.

A Reference Manual for the CUTIL.H Macros and Functions! I don’t know if this is done in Beta2 or not…

Is it exist some tools for enumerating cuda devices , testing and benchmarking it ?
I wish it. :angel:

There are several bit-level operations that are useful. A circular shift is used a LOT in cryptography, it’d be an efficient primitive. However, it would only be a savings of about 3 operations, since you can do it as a=(b>>x)|(b<<(32-x))) now, so this isn’t a crucial ability.

But another primative is trivial in hardware, but quite involved in software: bit reversal. Bit reversals are extremely useful for hashing and pseudo-random number generation, but rarely used because they’re so expensive. Binary FFTs ALSO use bit reversals, though I don’t know how much of a bottleneck it is.

There are several software methods for reversing bits, including lookup tables, but tables don’t map well to CUDA’s memory architecture. So currently we use:

ulong reverseBits(ulong v)

{

	v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1); // swap bits

	v = ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2); // swap consecutive pairs	

	v = ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4); // swap nibbles ...  

	v = ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8); // swap bytes	

	return ( v >> 16             ) | ( v               << 16); // swap 2-byte long pairs

}

That’s 23 steps! You can imagine how nice it’d be to do this in a single one-clock opcode!

OK, more brainstorming. Inspired by the very useful warp voting primatives, I wonder if the hardware could do similar warp-wide computes?

The most useful would be a warp tally accumulator. The threads that call this are counted and each thread is returned a count of the number of LOWER number threads in the warp that also called he tally.

The accumulated count makes the results immediately useful for efficient compaction of data or jobs. As an example, 32 threads do some computes and each thread comes up with a result. It may be a “yes, this ray hit something.” Or it may be “this pixel shows a UFO, and must be processed again.” The threads with interesting values do a warp accumulated tally. The threads that have results then just write the result they’re holding into an array at the array index given by the warp-tally-accumulator call. The resulting writtem array will be a compacted array of all the warp results. Useful for accumulating result values during searches. Useful for reordering thread JOBS when further processing is needed on just some data… threads can read the now compacted work list and start further evaluation… but the compaction means that it’s much much more likely for the top half-warp to be empty and therefore unscheduled.

The compaction could be repeated with each new round of tests… there would just be an offset added to accumulate a large list with no holes.

This could all be done now using a spare shared variable and atomic increments, though that does not preserve order, and will likely take many clocks… in worst case obviously at LEAST 32. Cheap warp-wide compaction would make it easier to use casually.

Can you give an example of how & when to use voting? I have not grasped the usefulness yet, so am afraid that I could use it in my programs…

math functions for double precision

What gives you the idea they don’t exist? It would be pretty pointless to provide double support without math functions that work on them…

Wishlist:

Lapack like Eigen decomposition routines.

That’s pretty much it right now.

–device-emulation releasing threads in random order from __syncthreads() to help detect errors. randomized block order execution might also be nice.

Two simple documentation enhancements, either of which will eliminate a huge hassle for me:

  • Add an index to the PTX ISA and Reference PDFs (or at least clickable chapter links). The Programming Guide has one, and it makes using it significantly faster; I spend maybe ten seconds finding something in the Guide, and about a minute in the other two. This made learning PTX a pain.

  • Alternatively, publish an HTML format of the guides.

Thanks for considering this.