Pycuda runs super slow on Jetson Xavier NX compared to running on CPU

I am trying to use CUDA coding to accelerate my program on Jetson Xavier NX, I installed the pycuda using ‘pip install pycuda==2019.1.2’.

When I run the following program on Jetson:

from pycuda import compiler, gpuarray, autoinit, driver
import numpy
import time
from pycuda.compiler import SourceModule

mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b)
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];

multiply_them = mod.get_function("multiply")

a = numpy.random.randn(1000).astype(numpy.float32)
b = numpy.random.randn(1000).astype(numpy.float32)

a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

dest = numpy.zeros_like(a)
dest_gpu = gpuarray.to_gpu(dest)

for i in range(10000):
    st = time.time()
    dest = a*b
    print('endtime = ', time.time() - st)


for i in range(10000):
    st = time.time()
    multiply_them(dest_gpu, a_gpu, b_gpu,
                  block=(1000, 1, 1), grid=(1, 1))
    print('endtime = ', time.time() - st)


It shows me that the CUDA version of multiply is 5 times slower than using numpy cpu.

First I thought it was a overhead problem, but i checked the time after 800-900 iteration, the CUDA version is still much slower compared to the CPU numpy.

Is this a normal behaviour?

FYI: I dont think is a general CUDA issue, the xavier im using can run deepstream and jetson-inference perfectly with fast spped

Not sure for your case, but be aware that CUDA stuff may take some time to upload and setup, so better have separate timings for first iterations and get real throughput when things are running later.

Yest, I actually tried to print the time everytime inside the loop. It still shows 46-100 times slower compared to CPU.

You may tell what CPU you’re comparing to.
I’m not a Pycuda user, so someone else may better advise.

In the above example code, the line is running on CPU is “dest = a*b”, and the line on GPU is ‘multiply_them(…’


Here is the underlying mechanism for each iteration in your implementation:

  • Copy a from CPU buffer to GPU buffer
  • Copy b from CPU buffer to GPU buffer
  • Compute a*b with GPU
  • Copy dest from GPU buffer back to CPU buffer

In general, we don’t copy the memory in each loop but just one time.
You can modify the implementation to see if it helps.


Hi AastaLLL,

Thanks for the reply, I have updated the code above which preload (gpuarray.to_gpu) the array to GPU, and then calculate them in the loop.

The speed is slightly better, but still 3-4 times slower than CPU runtime on Jetson it seems.
Anything else might be wrong in here?


Based on the implementation, the GPU kernel is triggered for only one calculation.
Since no parallel job is used, the execution time will be the sum of a sequential job (as CPU) plus some kernel launching overhead.

In general, it’s expected to performance some parallel task to benchmark GPU.
For example, a matrix add with each thread for an output index will be good.