CUDA 2.0 beta impressions

What I’ve noticed so far with the CUDA 2.0 beta

The good:

  1. Compatibility with 1.1: HOOMD ( ), a high performance GPU accelerated molecular dynamics simulation program, compiled and ran flawlessly on CUDA 2.0 with no changes

  2. No performance delta: At least with the benchmarks in HOOMD, I see no appreciable performance delta from CUDA 1.1. Ok, so it isn’t faster, but at least it’s not slower :) Register usage might be +/-1 for some kernels, I didn’t check exhaustively.

  3. Bug fixes in place, such as the cudaMemcpyToArray performance, and probably many more I haven’t noticed

  4. Additional features: There is the mentioned 3D texture support. I also noticed that events are improved somewhat and there are more device properties available from the query(like the number of multiprocessors). And a new warp size built-in variable that could be useful in some circumstances. There might be some other new features I’m missing here.

  5. Documentation improvements: There are man pages on linux now: very handy for a quick lookup of a function. “man cudaMemcpyToArray” is much faster than “open pdf; wait; wait; wait; ctrl-F ‘cudaMemcpyToArray’; wait; wait; next; next next; finally at the reference”. And the separation of programming/reference guides is good in theory, but… (see below)

The bad:

  1. There still doesn’t seem to be any way to “cast” a pointer to be a global pointer and get rid of the Advisory can’t tell what pointer is pointing to warnings. Did I miss it, or did this not make it into 2.0 beta?

… haven’t found anything else bad yet
The ugly:

  1. While separating the programming and reference guides is good, in theory, the implementation leaves something to be desired. The old appendix in the 1.1 guide nicely listed all functions by category. The new reference guide just gives a straight list, alphabetized I think.

The new reference is handy, but what I was missing is some kind of index, to see on which page number what function is explained.

Thanks for the comments. Keep them coming - suggestions for improvement are most welcome.


I was interested to see that in the SDK there are now references to sm 1.3 (grep for SM13 in the projects directory) which, based on context, will be the hardware version with double precision support.

So I would imagine the double precision GPUs are very close now. :)

Hey, good catch :) I never even downloaded the SDK.

Was this really “fixed”? My performance tests still show only ~12GB/s transfer speed for

DeviceToDevice transfers while bandwidthTest shows ~45GB/s for 2x 1024*768 floats.

While this is an improvement over 1.1 it is still not close to the 45GB/s. This is a major bottleneck in my application.

The bug I was referring to was specifically related to using cudaMemcpyToArray: it only gets ~4 GiB/s in CUDA 1.1 (on an 8800 GTX). In the CUDA 2.0 beta, I get close to 60 GiB/s. Is your issue with cudaMemcpyToArray? Can you post a short sample code that demonstrates the problem?

And also there’re 64bit atomic operations supported on sm_12 arch, which I didnt notice that before. (Maybe it’s already there… >.< )

__inline__ __device__ unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val);

__inline__ __device__ unsigned long long int atomicExch(unsigned long long int *address, unsigned long long int val);

__inline__ __device__ unsigned long long int atomicCAS(unsigned long long int *address, unsigned long long int compare, unsigned long long int val);

Ah, I wondered what sm_12 added, since last I recall, existing hardware only goes up to sm_11.