No performance delta: At least with the benchmarks in HOOMD, I see no appreciable performance delta from CUDA 1.1. Ok, so it isn’t faster, but at least it’s not slower :) Register usage might be +/-1 for some kernels, I didn’t check exhaustively.
Bug fixes in place, such as the cudaMemcpyToArray performance, and probably many more I haven’t noticed
Additional features: There is the mentioned 3D texture support. I also noticed that events are improved somewhat and there are more device properties available from the query(like the number of multiprocessors). And a new warp size built-in variable that could be useful in some circumstances. There might be some other new features I’m missing here.
Documentation improvements: There are man pages on linux now: very handy for a quick lookup of a function. “man cudaMemcpyToArray” is much faster than “open pdf; wait; wait; wait; ctrl-F ‘cudaMemcpyToArray’; wait; wait; next; next next; finally at the reference”. And the separation of programming/reference guides is good in theory, but… (see below)
The bad:
There still doesn’t seem to be any way to “cast” a pointer to be a global pointer and get rid of the Advisory can’t tell what pointer is pointing to warnings. Did I miss it, or did this not make it into 2.0 beta?
… haven’t found anything else bad yet The ugly:
While separating the programming and reference guides is good, in theory, the implementation leaves something to be desired. The old appendix in the 1.1 guide nicely listed all functions by category. The new reference guide just gives a straight list, alphabetized I think.
I was interested to see that in the SDK there are now references to sm 1.3 (grep for SM13 in the projects directory) which, based on context, will be the hardware version with double precision support.
So I would imagine the double precision GPUs are very close now. :)
The bug I was referring to was specifically related to using cudaMemcpyToArray: it only gets ~4 GiB/s in CUDA 1.1 (on an 8800 GTX). In the CUDA 2.0 beta, I get close to 60 GiB/s. Is your issue with cudaMemcpyToArray? Can you post a short sample code that demonstrates the problem?
And also there’re 64bit atomic operations supported on sm_12 arch, which I didnt notice that before. (Maybe it’s already there… >.< )
__inline__ __device__ unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val);
__inline__ __device__ unsigned long long int atomicExch(unsigned long long int *address, unsigned long long int val);
__inline__ __device__ unsigned long long int atomicCAS(unsigned long long int *address, unsigned long long int compare, unsigned long long int val);