Wishlist Place your considered suggestions here

Well, sorry to say so to an NVIDIA employee, but no, that is not true. Wumpus found out why a kernel from me was using local memory while having a completely vanilla nvcc commandline (and quite low registercount). This is what he said:

The problem seems to be in this function, provided by NVidia in /usr/local/cuda/include/math_functions.h

/* reduce argument to trig function to -pi/4...+pi/4 */

__func__(float __internal_trig_reduction_kernel(float a, int *quadrant))

It uses a locally defined array of 7 integers.

I will not be able to check until tomorrow (on Vista here…) but I kind of tend to believe him, especially as turning my sincosf to __sincosf gives me 0 lmem required.

Well, nothing prevents me from installing the Toolkit on Vista offcourse :)

So here are the dirty details:

/* reduce argument to trig function to -pi/4...+pi/4 */

__func__(float __internal_trig_reduction_kernel(float a, int *quadrant))

{

  float j;

  int q;

  if (__cuda_fabsf(a) > CUDART_TRIG_PLOSS_F) {

    /* Payne-Hanek style argument reduction. */

    unsigned int ia = __float_as_int(a);

    unsigned int s = ia & 0x80000000;

    unsigned int result[7];

So using sinf, cosf and friends gives you local memory use…

This use of local memory is by design and will not change. The trig
reduction has a fast path for small arguments and a slow path for large
arguments. Local memory is used to reduce the register usage in the slow
path, which is “fatter” than the fast path.
So this is beneficial for overall performance.

I apologize for posting incorrect information, the driver guy spoke out of turn.

I will strive to do better in future.

@nwilt: Sorry, I guess my words came across harder/different than they were meant, I meant that normally I prefer to not disagree with NVIDIA people (to minimize my amount of ‘being wrong in public’ moments ;)) The troubles of translating expressions in another language.

@mfatica: I think that if this is by design it certainly deserves a mention in the programming guide, since the local memory is always reported, even when in reality you will never enter the slow path (including a mention where the cutoff-point between slow and fastpath is)
Maybe you could even provide a fatsinf that does not use local memory? Me for instance I have another (very big) kernel, where in the beginning I am doing some sincosf’s. Those might use up some extra registers in the beginning, but not raise my register count, since later on I need the extra registers anyway.

Compiler-assisted memory coalescing.

Since memory coalescing:

  • is critical for performance
  • takes a lot of time to do manually, and
  • is, ultimately, pretty much a deterministic process,
    I would think that just about every CUDA program and every CUDA programMER would benefit from an optimization program that will take CUDA code with non-coalesced memory acceses and coalesce the memory accesses for the programmer.

I think this is better done in a library implementing an abstract data type handling coalescing transparent. I’m implementing one right now, but I can’t guarantee it will be of “release quality”, so even I would be more then happy if NVIDIA supplied one.

VPU is dedicated hardware, if I remember correctly in G80 it was off-chip (inside of NVIO) because it couldn’t fit.

http://www.beyond3d.com/content/reviews/11/8

So, not only they could expose IDCT, but also AES. Why would I want to reinvent the wheel?

i just want to trigger the nvidia developers to implement “zero copying” between dma devices.

one application of the cuda framework is, to process big data streams in realtime, like image processing. often, this data is still available in dma space by the source (e.g. firewire cam).

you are wasting cpu ressources a lot, if you dont use zero copying. and it’s easy to implement (with linux 2.6 :-) ).

that would be a real real-time push-forward.

thank you!

I’d like to have fast OpenGL interoperability with multiple GPU:s.

This could be implemented f.e. by supplying call:
cudaSetDevice(GLXContext *con)

where con is pointer to GLX context which could share buffers with this cuda instance without copying data trough host memory (like it currently does).

– Samuli

Hello
When CUDA support of Windows Server 2003 x64 is planing?
wb
Boris.

We shipped support for Win64 in CUDA 1.1.

hardware request for statistical computing applications: hardware random number generator w/gaussian normal distribution

I know this is a hardware wish and it could potential take up a lot of circuit area, but for high performance statistical computing there are a few things that are done often and are very compute-intensive, and I think hardwiring-them could result in a lot of performance improvement. The most important of which would be:

  • random number generation (gaussian normal distribution (zero mean, unit variance))
  • random number generation (flat distribution, from 0 to 1)

In AI, machine learning, data mining, statistical regression, singular value decomposition, and the like, noise is often added to the inputs to help prevent over-fitting. Although this provides over-fitting resistance not provided by other methods and can therefore lead to better predictions, it is currently very computation intensive, esp. compared to other methods, because so many random numbers have to be generated - ideally as many as one per every multiplication and addition (as often as a regularization parameter or learning rate parameter is used).

Thus, for statistical computing, a time-balanced chip – a chip in which the time it takes to perform an operation is inversely related to how often that operation is performed (like entropy encoding ( [url=“Entropy coding - Wikipedia”]http://en.wikipedia.org/wiki/Entropy_encoding[/url] )) – would have a random number generated in about the same time it takes to multiply two floats together.

One could employ a random physical process like the decay of a radioactive particle to make a true-random number generator that uses very little circuit area, although I imagine this would be much easier said than done, esp. via lithography.

I think a hardware random number generator would be the most significant contribution to statistical computing, as far as operators (instruction sets) go. It would provide a huge performance increase to common problems in statistical computing.

I’d like to have more control over the running kernel:

  1. Some way to find out the progress of kernel run.
  2. Ability to terminate the kernel from the host side (say, run kernel, wait for 1 second, if it is still not finished - force the GPU to terminate it).

Also, it would be nice if kernel with infinite (or very long) loop inside won’t hang WinXP up.

writable cached memory + cache flush mechanism

A type of memory that is cached and can be written to - but a read after a write is not guaranteed to reflect the latest write until a cache flush is performed (because a cache hit results in the old value being retrieved) (and it would be nice if the only a part of the cache is flushed)

or, in the alternative,

the ability to have a section of memory marked as constant/texture in one kernel, and global in another. such that in the former kernel it’s cached and read-only and in the latter kernel it’s not cached and writable.

The memory that a texture is bound to, is already writable if you use the global memory pointer, so this is already possible. It might even be possible when you give the address of getSymbol to a kernel to write to constant memory from a kernel, but for that I am not sure.

<1> Which one SP of MSP and which MSP used for computing determined by programmer
<2> More C++ spec supported
<3> Avoid increasingly complex in feature
<4> Embedded the ptx code
e.g
global void kernel( … )
{

Ptx{

}
}
<5> GPU system : and we can leaved the windows and linux or another :biggrin:
<5> The most important of all:the faster the better

Then how would you set this up; could someone provide a code sample of this (that they’ve tested and know works)?

my guess would be something like:

__device __ constant float p1[100];

__device __ global float* p2 = p1;

kernel1<<<…>>>(p1);

kernel2<<<…>>>(p2);

but does this work? And if you can do this kind of thing, why isn’t it in the manual?

Hmm, you did not check the manual maybe?

Well, this my wish list ( from my ray tracing perspective :play_ball: … which can be good or not for other usage ):

  1. More texture cache(6-8Kb to 1Mb). That will allow to traverse better a tree structure with severe divergency.

  2. Much more shared memory ( 16Kb to 256Kb ). With that I could put there a 64 depth levels tree stack instead of placing it in device memory. Alternatively I could use a hardware function call stack and recursive functions support.

  3. Much more constant memory ( 1Mb ).

  4. Kill the “warp” concept. Give us 512 true cores(I know, it’s hard) to perform divergent branching and operations. SIMD == pain for uncoherent/divergent threads. Make completely-scalar multiprocessors… and forget SIMD/warps…

  5. Implement the “register” C keyword to avoid the PTX compiler to use local device memory for the things I want placed in registers External Image . Currently the compiler tries to reduce the number of used registers moving some data to the device memory… which can be bad if my kernel is not limited by the registers/threads ( can be limited by memory or cache )…

  6. Add a local keyword to store the data in device local memory ( I know you can force it using arrays, but the local syntax will be much better ).

  7. DX10 buffer support ( and let me remind you that WinXP is not sold anymore in shops… so WinXP/DX9 is dead External Image ).

  8. Add a new function ( cudaGetVersion() ) to know if CUDA 1.0, 1.1 or 2.0 is present.

  9. Provide cudart_md.lib, cudart_mt.lib, cudart_mdd.lib, cudart_mtd.lib libraries so we won’t fight with /NODEFAULTLIBS and C CRT memory problems.

  10. Add VS2008 support… and a .NET WinForms CUDA example so you can be 100% sure the libraries deployed are 100% CLR/.NET STL compatible ( which, currently, aren’t due to C CRT conflicts ).

so you want an instruction decoder & ALU running at full speed? But when you have more than 1 instruction decoder on an MP I think you cannot have syncthreads() anymore, so no more shared memory…