If we want to do basic matrix-matrix multiplications using CUDA (e.g. C = A*B), we would need to copy the data stored in A and B from the host to the device memory (latency). Furthermore, we would need to utilize tiling techniques via the shared memory to limit the number of global memory accesses. But given the unavoidable latency transferring data from host to device memory, is using CUDA really better than say using OpenMP? If so, how can we a priori know it is better to use CUDA (rather than OpenMP with N cores) without actually writing the CUDA code? At what matrix dimension size (for a given number of OpenMP cores available) is it worth it to use CUDA over OpenMP?