I recently started to port some of my codes to CUDA. I have been successfully and I have now a codes that run nice on the Tesla cards. I have tested my cards on Tesla cards with 3GB of RAM. The maximum size of the data is quite large and it is helpful to use CUDA. For the maximum size of I could use the Tesla card was finishing the job in the same time as 96 core (12cores/node) using MPI. However I need also larger sizes. I am thinking about a configuration with 4 Tesla cards in an SLI mode.
Most of the time in my programs is spent on the Fourier Transforms of 2D or 3D matrices. Can the cufft library take advantage of the multi-gpu set-up, or do I have to use openmp and make batch transforms along each direction, transpose the matrix and then batch transforms again. Is the memory common when the cards are in SLI mode? Would 4 GPUs in SLI been seen as one GPU with 4 time more memory and 4 time more cuda cores?