Pycuda runs super slow on Jetson Xavier NX compared to running on CPU

yy17 · June 1, 2021, 9:56pm

Hi,
I am trying to use CUDA coding to accelerate my program on Jetson Xavier NX, I installed the pycuda using ‘pip install pycuda==2019.1.2’.

When I run the following program on Jetson:

from pycuda import compiler, gpuarray, autoinit, driver
import numpy
import time
from pycuda.compiler import SourceModule

mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply")

a = numpy.random.randn(1000).astype(numpy.float32)
b = numpy.random.randn(1000).astype(numpy.float32)

a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

dest = numpy.zeros_like(a)
dest_gpu = gpuarray.to_gpu(dest)

for i in range(10000):
    st = time.time()
    dest = a*b
    print('endtime = ', time.time() - st)

print('------------')

for i in range(10000):
    st = time.time()
    multiply_them(dest_gpu, a_gpu, b_gpu,
                  block=(1000, 1, 1), grid=(1, 1))
    print('endtime = ', time.time() - st)

print(dest-a*b)
print(dest_gpu-a_gpu*b_gpu)

It shows me that the CUDA version of multiply is 5 times slower than using numpy cpu.

First I thought it was a overhead problem, but i checked the time after 800-900 iteration, the CUDA version is still much slower compared to the CPU numpy.

Is this a normal behaviour?

FYI: I dont think is a general CUDA issue, the xavier im using can run deepstream and jetson-inference perfectly with fast spped

Honey_Patouceul · June 1, 2021, 10:23pm

Not sure for your case, but be aware that CUDA stuff may take some time to upload and setup, so better have separate timings for first iterations and get real throughput when things are running later.

yy17 · June 1, 2021, 10:26pm

Yest, I actually tried to print the time everytime inside the loop. It still shows 46-100 times slower compared to CPU.

Honey_Patouceul · June 1, 2021, 10:32pm

You may tell what CPU you’re comparing to.
I’m not a Pycuda user, so someone else may better advise.

yy17 · June 1, 2021, 11:31pm

In the above example code, the line is running on CPU is “dest = a*b”, and the line on GPU is ‘multiply_them(…’

AastaLLL · June 2, 2021, 2:50am

Hi,

Here is the underlying mechanism for each iteration in your implementation:

Copy a from CPU buffer to GPU buffer
Copy b from CPU buffer to GPU buffer
Compute a*b with GPU
Copy dest from GPU buffer back to CPU buffer

In general, we don’t copy the memory in each loop but just one time.
You can modify the implementation to see if it helps.

Thanks.

yy17 · June 2, 2021, 4:10pm

Hi AastaLLL,

Thanks for the reply, I have updated the code above which preload (gpuarray.to_gpu) the array to GPU, and then calculate them in the loop.

The speed is slightly better, but still 3-4 times slower than CPU runtime on Jetson it seems.
Anything else might be wrong in here?

AastaLLL · June 7, 2021, 5:05am

Hi,

Based on the implementation, the GPU kernel is triggered for only one calculation.
Since no parallel job is used, the execution time will be the sum of a sequential job (as CPU) plus some kernel launching overhead.

In general, it’s expected to performance some parallel task to benchmark GPU.
For example, a matrix add with each thread for an output index will be good.

Thanks.

Topic		Replies	Views
OpenCV-Cuda functions running far slower than expected on Jetson Xavier NX Jetson Xavier NX opencv , cuda	6	1820	August 18, 2022
Cupy or pycuda on Jetson Xavier NX Jetson AGX Xavier cuda , python	7	2638	February 24, 2022
Simple CUDA example 4x slower on Xavier AGX GPU than CPU Jetson AGX Xavier cuda	2	582	March 4, 2021
No speed-up on jetson xavier nx Jetson Xavier NX cuda	2	486	March 18, 2022
Simple CUDA Program has slow runtime Jetson AGX Xavier cuda	5	1011	January 27, 2021
how execute Python code on GPU Graph Jetson TX1 Jetson TX1	7	1743	June 15, 2018
Xavier NX GPU Code Latency Jetson Xavier NX	4	859	March 14, 2022
Start using pycuda quastions CUDA Programming and Performance	1	228	May 6, 2024
CUDA performance Jetson Xavier NX cuda	3	391	January 18, 2021
JETSON takes triple time than X86 (i3) Jetson TX1	2	1343	November 3, 2017

Pycuda runs super slow on Jetson Xavier NX compared to running on CPU

Related topics