too little memory for seriuos computations too little memory for seriuos computations on cuda suppor

I think this is too little memory for seriuos computations on cuda supported devices.

And too small pci express (cpu memory → device memory) throughput.

What do you think about it ?

I want to said that my realisations on cuda was fast but it’s too easy to realise it on cpu and it’s faster (when computing) with serious memory usage.

What do you call serious? I do some serious (as in it runs a long time, and customers pay good money for results) computations on 3D grids that seldom exceed 100 in any dimension. So that’s about 8 Mbytes, and most CUDA-capable cards seem to come with half a Gbyte or more these days.

The space of high performance / scientific computing problems is vast, and different regions have different needs. CUDA doesn’t accelerate your web browser or Microsoft Excel, but CUDA works for many other situations. To declare the currently working applications “not serious” is silly, though.

With 1 to 3 GB available on reasonably priced consumer devices (just got a 3GB GTX 580 in the lab yesterday!), I think CUDA has plenty of memory for many problems. If your working data set exceeds 6 GB (the largest CUDA device available), then you will need to figure out how to partition your data. That is annoying, but people have had to do that since the dawn of computing. Your average cluster compute node probably “only” has 12 - 36 GB of system RAM, which is just a factor of a few different than a CUDA device, and compute nodes had far less than that 5 years ago. Researchers figured out how to work around these limitations with clever algorithms, just like they do now when processing multi-terabyte/petabyte datasets. The same techniques apply to multi-device and multi-node CUDA.

The bandwidth and latency of PCI Express also present interesting limitations, but none that prevent many kinds of “serious” work. I do hope that we will see the integration of a few generic CPU-like cores integrated onto NVIDIA GPUs, if only to explore what’s possible when you have strong coupling. (AMD’s Fusion line is already shipping slower processors that could demonstrate this, but I don’t know if the overhead of OpenCL needs to be reduced to fully exploit it.) But right now, PCI Express is fast enough to make CUDA useful for many situations.

And besides, no one really needs more than 640K on a personal computer :-)

Hardly not enough memory. Even on the consumer device, you get up to 1.5 GB. This is enough to do floating point mathematics on 400 million pixels. The Host->Device bandwidth issue is unfortunate right now but PCI-Express 3.0 is around the corner and promises to double the bandwidth. Anyways, there are tons of classes of algorithms that will easily bottleneck on the actual computing of the GPU instead of the I/O interface

And you can quadruple that capacity and other nice bonuses (like ECC) if you’ve got the cash if you go to the professional cards.

When a vector A is added to a vector B and this results in vector C,

then the GPU is much faster than the CPU.

However-1, the time transferring the vectors A and B to

the GPU and back the vector c from/into CPU-Memory probably

exceeds the time, the CPU would need for this operation.

The potential for the GPU raises with the complexity

of the operation, that sooner or later

the transfer effort is payed back by the GPU.

But not each kind of operation can be

phrased for a GPU, that it runs it faster

than a CPU.

However-2, the CPU-memory is not manufactured

and shipped with data.

It has to be filled by I/O

operations, which are very slow,

in contrast to GPU and CPU.

In theory I/O can be done by GPUs

directly, the TESLA card comes with

Inifity-Drivers for direct I/O.