Is CUDA right for my application?

I have two arrays. Two big, 10.4M elements each arrays (one dimension, 32-bit integers), taking up ~40MB of ram apiece. They have a few simple arithmetic operations preformed on them. It’s the same operations for every element, and the resultant has the same index (e.g. A[i] % B[i] = C[i]). For such a few simple operations per element, is CUDA going to significantly speed everything up? I will be able to copy the entire arrays into GPU memory once, have the threads complete, and copy the resultant back to system memory for further manipulation.

Cheers and Happy New Year!
Kale

If you have to copy new array contents in each kernel call, then I suspect no. The bottleneck in that case will be the PCI-Express bus, which can move data at 6 GB/sec in the most ideal conditions (Core i7, pinned memory). I would expect for simple arithmetic operations, the CPU could easily keep up with 6 GB/sec, but you should check.

If an array can be loaded once and operated on many times, CUDA can help you because the on-card memory bandwidth of a high-end GPU is > 100 GB/sec.

Thanks for the info mate! I suspected as much, but wasn’t sure (this is what happens when us non-CS people program). I do have to preform a Number Theoretic Transform before all of the simple math, which is very similar to a FFT, but uses modular integer math rather than floating point, and I’ve been a bit wary of trying to write something that complicated in CUDA. I’m not the most adept programmer. It sounds like CUDA could speed everything up if I did the FNTT, then the arithmetic, before inverse NTT and copying back to system memory.

Thanks again for the info, mate. And happy new year to all!