I think this is too little memory for seriuos computations on cuda supported devices.
And too small pci express (cpu memory → device memory) throughput.
What do you think about it ?
I want to said that my realisations on cuda was fast but it’s too easy to realise it on cpu and it’s faster (when computing) with serious memory usage.
What do you call serious? I do some serious (as in it runs a long time, and customers pay good money for results) computations on 3D grids that seldom exceed 100 in any dimension. So that’s about 8 Mbytes, and most CUDA-capable cards seem to come with half a Gbyte or more these days.
The space of high performance / scientific computing problems is vast, and different regions have different needs. CUDA doesn’t accelerate your web browser or Microsoft Excel, but CUDA works for many other situations. To declare the currently working applications “not serious” is silly, though.
With 1 to 3 GB available on reasonably priced consumer devices (just got a 3GB GTX 580 in the lab yesterday!), I think CUDA has plenty of memory for many problems. If your working data set exceeds 6 GB (the largest CUDA device available), then you will need to figure out how to partition your data. That is annoying, but people have had to do that since the dawn of computing. Your average cluster compute node probably “only” has 12 - 36 GB of system RAM, which is just a factor of a few different than a CUDA device, and compute nodes had far less than that 5 years ago. Researchers figured out how to work around these limitations with clever algorithms, just like they do now when processing multi-terabyte/petabyte datasets. The same techniques apply to multi-device and multi-node CUDA.
The bandwidth and latency of PCI Express also present interesting limitations, but none that prevent many kinds of “serious” work. I do hope that we will see the integration of a few generic CPU-like cores integrated onto NVIDIA GPUs, if only to explore what’s possible when you have strong coupling. (AMD’s Fusion line is already shipping slower processors that could demonstrate this, but I don’t know if the overhead of OpenCL needs to be reduced to fully exploit it.) But right now, PCI Express is fast enough to make CUDA useful for many situations.
Hardly not enough memory. Even on the consumer device, you get up to 1.5 GB. This is enough to do floating point mathematics on 400 million pixels. The Host->Device bandwidth issue is unfortunate right now but PCI-Express 3.0 is around the corner and promises to double the bandwidth. Anyways, there are tons of classes of algorithms that will easily bottleneck on the actual computing of the GPU instead of the I/O interface
And you can quadruple that capacity and other nice bonuses (like ECC) if you’ve got the cash if you go to the professional cards.