Wishlist Place your considered suggestions here

I use “flattened” textures now. True arrayed 2D textures (or 3D textures) would make my code simpler, more readable, easier to debug, and easier to maintain. It would also reduce register usage (clamping to boundaries on a flattened texture requires extra tests and extra registers) - with a texture array I could just let the hardware do it.

Global synchronization for blocks of thread, that enable the ability to run CUDA functions consecutively without calling from host, especially for processing data flow on GPU.

I think cudaThreadSynchronize() did that but why don’t we have similar thing in device function

Is bigger memory or virtual memory support possible

Improve CUDA for large servo loop applications:

  • low latency (<5us) for host<->device data transfert
    zero-copy transfert from/to another PCIe I/O board to/from device memory would be cool
    some kind of remote DMA from/to a distant node to/from device memory would be great

  • low latency (<5us) for periodic launch of the same kernel
    (or some way to quickly (re)start a kernel when fresh data is available)

  • adding some I/O capabilities to CUDA devices.
    Question: Is it possible to use DVI or SLI ports for generic data I/O ??

anyway CUDA is already great !

stream memory transfers between system and device memory. i.e. implement a mechanism to automatically prefetch data while the gpus are processing, and write results back to main memory as they complete, asynchronously (using DMA, i imagine).

i read that the 8x series gpgpu can’t process (a) kernel(s) and do system-device memory transfers at the same time. Thus, one has to do a memcpy to the system, processes the kernel, then memcpy it back. during memcpy’s, the GPU processors are idle (wasted), and vice-versa.

ofcourse, doing both operations simultaneously may be complicated, because you don’t want to overwrite device memory that a processor is using. However, when both reads and writes are sequential (that is, would be sequential if it wasn’t multi-threaded) – which is very common in data=parallel processes – avoiding these situations is straightforward. In such situations, memory can be transferred sequentially, both in and out of device memory, while the GPUs are running. In other words, device memory (or part of it) would operate much like a set of buffers.

furthermore, what happens when the device does not have enough memory to store all the data that needs to be processed? without streaming memory operations, the data would have to be operated on in pulses; transfer in, process, transfer out, transfer in, process, transfer out, etc. with a mechanism implementing data-streaming, the data could be operated on continuously. the resulting total time to process, instead of being the combined sum of memory-to-device transfer time, GPU processing time, and device-to-memory transfer time, would be the largest of the three.

since system-device memory operations are a major bottleneck with this technology, implementing a prefetch (and writeback when complete) mechanism could potentially greatly increase performance (by up to 3x) while using up little die-space.

This is a feature of CUDA 1.1 and CUDA devices with compute capability 1.1…

Which reminds me of my wish: A compute capability 1.1 device that has at least the same specs as the 8800 GTX, especially the 384 bit wide memory bus. :)

It would be great if a Multi-processor could additional “host” a dedicated processor for “sequential processing”.

Why I say this is : Very often we have boundary situations where only one thread needs to do some sequential processing… For exmaple:
"
if (threadIdx.x == 0)
{

}

OR

if (threadIdx.x & (WSIZE-1) == 0) /* 1st element of every WARP */
{

}
"

Since a WARP can be productive only when all 32 threads are actively processing data, a situation like above can really put brakes on processing speed.

In the case where only one thread in a WARP is executing, we are running at crawling speed when compared to the CPU.

Thus, WARPs should be able to delegate sequential code to a sequential processor and resume or wait until it completes. I would like the sequential processor too to execute at 1.35GHz with a faster instruction issue rate.

This would make sure that a CUDA program running even in only one multi-processor WILL ALWAYS be an order of magnitude faster than the CPU program.

Ah, I see now, section 5.4.2.4 of http://developer.download.nvidia.com/compu…_Guide_1.1.pdf:

Stream Management

The functions from Section D.3 are used to create and destroy streams and determine whether all operations of a stream have completed or not.

The following code sample creates two streams:

cudaStream_t stream[2];

for (int i = 0; i < 2; ++i)

cudaStreamCreate(&stream[i]);

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,

size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>

(outputDevPtr + i * size, inputDevPtr + i * size, size);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,

size, cudaMemcpyDeviceToHost, stream[i]);

cudaThreadSynchronize();

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling myKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Processing hostPtr using two streams allows for the memory copies of one stream to overlap with the kernel execution of the other stream. hostPtr must point to page-locked host memory for any overlap to occur:

Not exactly what I was envisioning. This is still pulsed operation, but with multiple (thou few, in this example) asyncronous pulses. I can see how it is possible from this to have the same performance improvement as a truly continuous stream would be. However, it has some limitations not present in the design that I idealized:

  1. The “streams” are of fixed size. i.e. they are not truly “streams”. By “streaming” I mean some kind of queue/FIFO operation of possibly indefinite length/duration. Like a print queue or an internet socket.

  2. “hostPtr must point to page-locked host memory for any overlap to occur” one advantage of the data-streaming I’m suggesting is that host memory would NOT need to be page-locked - it could be presently swapped out to the hard disk, and the operating system would have to swap it back in. Thus, with the data streaming i’m suggesting, you would neither be limited by on-device memory size nor sysetm physical memory size, but rather system virtual memory size. (I’ve noticed that some wishlists included mention of “virtual memory”) You wouldn’t even have to be limited by that. At the O.S. level, you could possibly be connected to a file stream (such as a character device file) or even network stream, or (at a lower level?) the stream could cycle through a finite dataset, applying the kernel iteratively. In any case, the device would simply notify the CPU when results are ready, and the CPU could notify the device when new data for a given kernel instance is ready. (and the device would signal the CPU when its done copying it.)

  3. It doesn’t enable plumbing (i.e. arbitrary data piping). By this I mean that with the current implementation, you can’t simply plug an output stream into an input stream by saying something like “connect a_out to b_in”. This would be really cool. The ability to plug an output of one kernel directly into an input of another kernel – potentially without having to use system memory or even the system bus for transfer. (just transfer from write-memory to read-memory) The ability to stream data back and forth among kernels on the same device, kernels on different GPGPU devices, and system (virtual) memory (which could then plumb to other machines) A virtual, extensible, live, kernel network.

Ultimately, I would like to see “plumbing” that can operate between kernels (and reductions), GPGPU devices, and system (virtual) memory.

(I know this is largely a hardware feature request for nvidia gpgpu devices. I only put it here because I can’t find any hardware feature request thread.)

I honestly don’t think that you request any hardware features.

You state yourself that the hardware is already capable of what you want.

The things you suggest would be nice, I agree. But in my opinion they must be implemented in layers above what we currently have.

@ 1. The fixed size of streams can be circumvented by putting a layer on top of the current API that hides the fixed size.

@ 2. This can be hidden by additional layers. I can imagine a library that provides means of creating a stream and internally utilizes a small page locked memory space for staging and transfer.

@ 3. Again, implementation related.

I think you can create all these things yourself.

I don’t think there are truely continuous streams in computers. Even print queue or sockets accumulate a job/packet before sending it out. As was mentioned above, you could write that kind of abstraction yourself.

As transfers from non-locked CPU memory go, DMA needs a physical address, otherwise the CPU gets involved. CPU is the factor limiting the non-locked memory transfers.

Paulius

Certain GPUs can transfer data between host and device memory and do kernel computation at the same time. STREAMs can take advantage of such GPUs by overlapping both of them. I am sure STREAMs can save lot of time.

Even in case of job-queuing in printers – it makes sense because the host machine (that issues print request) need NOT wait for a prior request to complete. Thus there is a overlap of job-queing and job-processing on the printer side resulting in superior performance.

It is the same with “command queing” in SCSI world too. In SCSI, task queing facilitates intelligent processing of commands by the disk by ordering the jobs in hand to get the best performance out of it.

I meant to say “continuous” streams. My post above is corrected.

Paulius

I did some stuff using async streaming with CUDA: while the CPU is uploading wavelet subbands to page-locked memory, the GPU is copying them to their proper place in the image. After each subband, cudaMemcpyAsync is called. After all the subbands, the full wavelet transform is done on the GPU, while the CPU is decoding other stuff again. This kind of overlapping works very well.

The G84+ also can do host->gpu transfer and gpu computation at the same time, I have not tried that yet as my code needs to be compatible with all 8xxx series.

Ah this is the wishlist topic, what about : direct mapping of the framebuffer to CUDA memory ? I know almost for sure it’s possible hardware-wise, it would be little bit more than some remapping magic in the CUDA library.

On that note, I took the first step in that direction: I realized that in order to really do it right, one needed a language in which one could write how streams of data are to flow, so I wrote a specification for a data-flow definition language. I put it online here: http://www.wikinfo.org/index.php/DFDL for anyone interested. It’s an extension of C++ with some similarities to BrookGPU, but in my opinion it’s more powerful, flexible, and concise. None of it is implemented, just a spec/proposal. For brain-food or for the ambitious.

I think what I really want to see - and this IS hardware - is for the GPU to generate an interrupt after it’s written so many bytes to GPU memory, or read so many bytes from GPU memory, so that the host could then asynchronously commit the writes to main memory to free up memory on the GPU or send the GPU more data to read (overwriting the data that has already been read). Now a process may need to read a byte a few times before it’s done with it, so this would need to be taken into account.

That is, I think an asynchronous signaling mechanism via interrupts would be nice. If this is already done, then nevermind.

Another hardware item.

Often the memory access pattern is known a priori. It would be nice if the programmer could tell this to the cache controller, so that the cache controller could asynchronously prefetch the upcoming memory accesses, reducing cache misses down to – in many cases – zero.

I’m thinking that one could associate a few numbers with each kernel (or each streaming memory access of each kernel) that represent the memory access pattern, and these numbers could be sent to the cache controller when the kernel is invoked. Then, as the kernel requests data from the cache controller, the cache controller would use these numbers to figure out what memory addresses to prefetch, and do so.

Some pseudo-code for a possible memory access pattern, defined by the numbers (i0,i1,s1,i2,s2,i3):

data_type p;
for( int j = 0; j < s2; j++)
for( int i = 0; i < s1; i++)
read( p[(i
i1+j*i2)i0]);
p += i3
i0;

Two related papers I found that might provide some background:
[url=“http://www.cs.utah.edu/classes/cs7940-010-rajeev/spr05/papers/lee03.pdf”]http://www.cs.utah.edu/classes/cs7940-010-...apers/lee03.pdf[/url]
[url=“http://portal.acm.org/citation.cfm?id=605462&dl=GUIDE&dl=ACM”]http://portal.acm.org/citation.cfm?id=605462&dl=GUIDE&dl=ACM[/url]

Hello, and thanks for CUDA.

My wish is support for Quadro FX 3500.

It is very disappointing that this one card of 1100 € is not supported.

Is it possible that is supported some day?

Thanks.

No,
the Quadro FX 3500 has a G7x GPU.

CUDA needs G8x or G92 GPUs.

Ok, thanks for the info mfatica.

bye bye CUDA … hello RapidMind (www.rapidmind.net)

After buying my 8800GTS and waiting for 11 months, it is safe to say my biggest wish is that Vista would be supported. I am very close to buying another harddrive and installing Linux on it just so I can develop my CUDA application. Very inconvenient.