CUDA Toolkit and SDK 2.3 betas available to registered developers

That’s understandable. Just to be clear, it will only suppress warnings within files in that directory, not globally. You wouldn’t be able to use libc without that option.

My program using FFT became slow.

Windows Vista and XP, GTS8800, 512M, compiled with SM1.0.

Graet news! Could you provide us with some code samples?

Is it possible to allocate, memcpy half_float*?

CUDA doesn’t have a built-in half float type, so basically you allocate data using the unsigned short type (which is also 16-bits), and then convert it to float on reading using the new intrinsics.

If you want to use half-precision on the host, I would recommend using the half class included in the OpenEXR distribution (which is compatible with GPU halfs):
[url=“http://www.openexr.com/”]http://www.openexr.com/[/url]

I’m trying to run the profiler from the 2.3beta (Fedora 10 x86_64). Whenever I try running one of my own programs, the profiler successfully runs it, but then has an “Error reading profiler output.” I can see a bunch of .csv files being produced though. And if I try running one of the SDK examples, it is profiled successfully. Has anyone else seen a similar problem?

I’ve tried adding cudaThreadExit() to the end of my program, and I’ve also got libstdc++.so.6 in /usr/lib/. Also, $LD_LIBRARY_PATH points to the cudaprof bin/ directory.

Are you using the separately packaged profiler?

From the download site:

I would like to use flt32 on the host, copy them into a flt16 array on the device, perform some computation on it, on download them back in flt32 on the host. How should I do that? (the idea beahind this is to fit a table twice bigger on the device)

Other point, are the specs of flt16 available? Does it have a better precision between [-1;1]?

I am now (2.3.09), but that’s not fixed the problem.

To be clear, current GPUs don’t natively perform any computation on fp16 (half) values, they can only convert quickly from fp16 to fp32, perform computation in fp32, and then convert back to fp16. It is only really useful as a storage format.

There are details on half precision here:

http://en.wikipedia.org/wiki/Half_precision

http://www.nvidia.com/dev_content/nvopengl…float_pixel.txt

Got it! Thanks

Is there a text version of cudaprof, or is there a way to use it in text mode? (There was a text version some time in the past, I think)

THX

-JL

Yes. There always has been a way to profile from the command line. Just read the manual… /cuda_install_location/doc/CUDA_Profiler_2.2.txt

Any news on the Ocelot release?

Ben

Does CUDA 2.3 support texturing from fp16 arrays?
And linear memory?

Did anyone try this?

Hi Simon,

Could you please elaborate on where I can find some documentation on using the new intrinsics to convert from half (stored as unsigned short) to float? Is this part of CUDA 2.2?

Thanks,

Rohit

The 2.3 programming guide, section B.9 Time Function… is unchanged since the 2.0 docs.

This paragraph is confusing mostly because it’s wrong. The result of clock() is not per thread at all. A simpler paragraph would work better, something like:

There’s two other unanswerered questions about clock(). Do other blocks on the SM affect clock, or is it really per-block? (I think, but am not positive, that it’s per block. Which means the first sentence of the above paragraph should say “per block” and not “per multiprocessor”)

Do memory latency stall waits (when all threads are paused) get counted in clock()? (I think, but am not positive, that no, there’s no increment during those waits.)

Will CUFFT ever support asynchronous execution, and streams? It would be really useful for improving throughput :)

Also, built-in support for (1D at least) FFTs of unlimited size (i.e., memory limited) would be sweet. As was mentioned on these forums before, you can do big FFTs using smaller ones and a matrix transpose; so could this be built in to CUFFT?

Finally, is there an expected date for the 2.3 release?

can [linux] 2.6.30 be supported please? there’s some change in some of the header files, so the driver won’t compile.

thanks much,
nicholas

I’m not sure if this has been updated already in 2.3, but something I just noticed in 2.2 is that the syntax highlighting (usertype.dat) doesn’t highlight __threadfence() and __threadfence_block(). Kind of small, but there you go…

Hi Simon,

Could you please explain this in more detail? Also, is this supported by the runtime API?

So I have an array of 1000 halfs. I declare an unsigned short type pointer and cudamalloc 4000 bytes. Then, when I want to use this array in a kernel, what do I use for this array in the function signature? Unsigned short *? If so, then how do I let CUDA know that I want it to treat this array like floats?

I am sorry to be borish, but this question has a rather urgent business need, and we will soon be a big customer. Thanks.