I have an application that use a large array/table(100+ MB) to store constant values. The array access for different threads may across multi-128Byte address and can’t be coalesced. Will this application be speedup by CUDA? Or the global memory access will be faster than CPU?
It may be faster on a GPU. It will depend on the exact access pattern(s) and the raw memory throughput of the GPU.
In general, large tables are almost never a good idea with modern CPUs and GPUs (for reasons of both performance and power consumption) and I would recommend replacing tables with computation where feasible. Especially on GPUs, it is mostly true that “FLOPS are too cheap to meter”.