CPU faster than CUDA

Hello all.

I’m starting to look into parallel processing with CUDA. Been doing most of my programming in 3D modelling software up to this point, but really want access to the CUDA library to speed things up. Nevertheless, I’ve installed it and I am getting the unepected result that the example file is running faster without the parallel. For example, the code attached is returning:
VectorAdd took for 0.7805259227752686econds
VectorAdd took for 1.8527252674102783econds

I know this is probably a fairly dumb question, but any help would be appreciated. The GPU is a Quadro P4000.

import numpy as np
import time

from numba import vectorize, cuda

def VectorAddCPU(a, b):
    return a + b

@vectorize(['float32(float32, float32)'], target='cuda')
def VectorAddGPU(a, b):
    return a + b

def main():
    N = 320000000

    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)

    start = time.time()
    C = VectorAddCPU(A, B)
    vector_add_time = time.time() - start
    print("VectorAdd took for % seconds" % vector_add_time)

    start = time.time()
    C = VectorAddGPU(A, B)
    vector_add_time = time.time() - start
    print("VectorAdd took for % seconds" % vector_add_time)

if __name__=='__main__':
    main()

Short answer:

GPUs are fast for computations. But a vector addition mainly consists of reading and writing data (with a cheap addition in-between). Instead of doing a trivial a+b, try doing something like cos(a)*sin(a)*cos(b)*sin(b) (this doesn’t make sense - it’s just to create an “artificial” computation workload). The GPU will most likely be faster then.

Long answer: The introductory part of what I wrote here.

GPUs are not only fast for computation. They are also, in many cases, fast for data movement

Issues of “CPU faster than GPU” in my experience boil down to these major categories:

(1) Comparing high(er)-end CPU to low(er)-end GPU. The performance spectrum differs by a factor of ten on both sides of the computing universe. The average speedup on the GPU across many use cases for equally well-optimized code on high-end equipment is around 5x, with a typical range of 2x to 10x.

(2) Too much data movement between CPU and GPU (host and device). The CPU might have a memory bandwidth of 100 GB/sec and the GPU one of 500 GB/sec, but they are currently most often connected by a PCIe gen3 link with at most 12 GB/sec per direction.

(3) Memory bandwidth limited computation on small datasets. If the use case makes good use of CPU caches with their extremely high bandwidth, the memory bandwidth advantage of the GPU is nullified.

(4) Use cases that are a poor fit for present GPU architectures: Code with lots of data-dependent branches or “random” memory access patterns. Latency-dependent computations. As GPU architectures have become more flexible with every new generation, the universe of “unsuitable” use cases continues to shrink, so when it doubt, try it out.

In this case I would guess we are looking at an instance of either (2) or (3).