CUDA 2.2 beta features

Two questions about the 2.2 beta:

  • Is it possible to use the current implementation of cuFFT with memory allocated as zero-copy memory? I assume you just pass it memory allocated as pinned host mem rather than device memory, right?
  • Have kernel launch times reduced under Vista at all? I seem to remember them being quite a bit higher than under XP; since the async copy problem has been solved maybe this has been too…

I’m assuming a GTS250 can do zero-copy being part of the 200 series, it’s making an upgrade seem much more tempting now :)

Thanks
Matt

Actually, it sounds like a compute capability 1.1 device:

http://forums.nvidia.com/index.php?showtopic=91511

Model number selection at NVIDIA is frustratingly capricious and confusing at times…

The GTS250 may be part of the 200 series, but it’s still based on the older G92 chip, so it’s really not quite the same as the rest of the hardware in the series. I would imagine that any of the GTX boards will do though (and GTX260-216 cards are available for under $200 now).

GTS 250 indeed does not support zero-copy because it’s G92 based. More confusingly, MCP79 is Compute 1.1 but supports zero-copy and copy elimination while the earlier MCP chip (the one in 780a) doesn’t. There are two new device properties (not sure if they’re in the beta, they’re definitely in final) to determine if a device is capable of doing zero-copy and copy-elimination (the latter is basically a test for MCP79 and implies the former).

Pretty sure CUFFT can do FFTs on zero-copy memory (this is part of the reason why I suggested somebody do some audio stuff with zero-copy on MCP79).

What about the faster FFT implementation by vvolkov that was talked so much about in [url=“http://forums.nvidia.com/index.php?showtopic=69801&hl=speedy+fft”]http://forums.nvidia.com/index.php?showtop...p;hl=speedy+fft[/url] , is it going to be a part of 2.2 ?

I don’t think this made it into 2.2, but I’m not 100% sure.

Hi, I’m a senior computer science researcher of the Italian CNR (National Council of Research, more or less…)

Do you think I can gain access to the 2.2 beta somehow?? I had a look to the form you pointed to, but several entries of the

form make little sense if one is academic and has not a in mind the development of a specific application…

Alternatively: is there an tentative release date of 2.2 to common mortals?

Thanks,

giovanni

g.resta@iit.cnr.it

Apply via the form, fill in whatever information applies, and just say that you were pointed here by this forum thread. I’ll try to make sure it gets approved ASAP.

What’s the minumum memory read chunk size for zero-copy memory?

For current CUDA device memory access, we know we read by half-warp chunks, so at the minimum we read 16*4=64 bytes at a time even if we use only one word.

But for zero-copy these rules may be totally different… and any difference could have an impact on algorithms. On a modern CPU, a cache line is also 64 bytes, coincidentally the same as the CUDA device chunk size, and I would guess that this is the size of a PC’s memory system transaction as well.

I wonder just because there are applications with very diverse scatters of small reads or writes which may deal with only single-word values in memory, but you end up wasting 15/16 of the device bandwidth. Is this true for zero copy memory as well?

Many typical applications which are affected by this minimum access chunk size. Querying a 100MB hash table. Reading a vertex position (3 words) from an indirect vertex ID list. Loading particle positions from a list of each particle’s neighbors… etc.

Wouldn’t that be same as for other pci transfers? So around 10K

I think it will be 1 byte.

What if only 1 thread out of all reads a character from memory. They have to accomodate this requirement.

I am more interested to see whether the latency of accessing memory across PCI-e can be hidden with lot of threads… Sometimes for even hiding global memory accesses, we need 256 active threads…

Now, with the total number of active warps increased to 32 in recent devices, I could ideally have 1024 threads active… With PCI-e bandwidth operating probably at 1/10th the speed of the device-internal-bus, it would be interesting to see how latency hiding happens. (Is the 1/10th thing, right?..)

Ah, actually then there’s TWO questions. What’s the chunk size for G200 style PCI zero-copy, and also for MCP79? They’re completely different questions since the MCP79 doesn’t even involve PCIe.

I think I suddenly see the appeal of laptops like the newer Macbooks which have both an embedded and a PCIe linked GPU. There’s a ton of fun stuff the embedded GPU can do with its newly-easy memory access, even if it’s not too powerful!

I am an audio guy, I’m working on a ‘low latency’ convolution VST plugin that tries to put an absolute minimum load on the CPU so I’ll go into the issues I am hoping this 2.2 version will help with. I need to do block processing of around 512 samples or higher to get the plugin working ok (i.e. having time to do all the transfers and processing in the time available), but more like 4096 samples to make it work well with multiple instances, however latency this high is not really acceptable for many uses in the audio world so I am trying to get it down.

The basic problem I have is how the effective memory bandwidth goes down so much when transferring a given quantity of data in smaller and smaller blocks, as verified with shmoo tests set to smaller sizes like this. The time available for processing (dictated by the block size and sample rate) quickly gets gobbled up by the time taken to transfer the data as you shrink the block sizes down. The actual data processing is pretty lightweight (two FFTs per block and a bunch of multiply add operations done with a reduction) and will probably get faster and faster as cards improve, but the memory bandwidth for small transfers doesn’t seem to improve that much as generations of cards progress (I’ve seen a few shmoo tests around the forums, they better my creaking GT8600 but not really by orders of magnitude like increasing transfer sizes does).

So, will the two new memory processing options presented in 2.2 (zero copy / copy elimination) help speed up transfers of small blocks? I had assumed the system bus was the bottleneck here, is that right? Are small transfers across the bus going to be done that much faster if data goes from main memory, avoids sitting in the device memory and goes straight into the SM for processing and then back out again to main memory?

The biggest question is the “latency gap” created by this PCI-E transfers. If you have enough threads to hide these latencies, there is no cooler thing than this feature. If not, you need to think twice before using this feature.

Note that higher compute capable cards (cant remember from which compute capab) have 32 active warps (as against to 24 active warps). So if you have enough threads in the SM, you can actually be benefited.

Does anybody know where to find the 2.2 profiler?

It is now part of the toolkit.

Please make the installers for windows 32 and window 64 independent so that they can coexist on the same computer.
I know you can do lot of twist and hacks and get the libraries to compiler but taht do no seems to be the right approach for a proffesional SDK.
specially when the SDK is still evolving and making big updates.
Big developers tools like Visual Studio do not required previus install to be removed for new releases. you can have VS2002, VS2003, all the way until VS2010 in the same system.

I do not really understand why Nvidia staff do not even address the issue when it appears to be real nuance for developer on both system 32/64 bit system
that wants to incorporate CUDA support to thirs projects, even if we are insignificant compare to the larger crowds.

We’re working on it.

cudaprof is missing from the FC10 toolkit, but I just saw it is in the RH5.3 one and in the FC9 one, so I will install the FC9 cudaprof when back at work monday.

Very good decision to ship it with the toolkit.

The 2.2 beta programming guide is now in the first post.