Hi,
I am trying to use CUDA coding to accelerate my program on Jetson Xavier NX, I installed the pycuda using ‘pip install pycuda==2019.1.2’.
When I run the following program on Jetson:
from pycuda import compiler, gpuarray, autoinit, driver
import numpy
import time
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply")
a = numpy.random.randn(1000).astype(numpy.float32)
b = numpy.random.randn(1000).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)
dest = numpy.zeros_like(a)
dest_gpu = gpuarray.to_gpu(dest)
for i in range(10000):
st = time.time()
dest = a*b
print('endtime = ', time.time() - st)
print('------------')
for i in range(10000):
st = time.time()
multiply_them(dest_gpu, a_gpu, b_gpu,
block=(1000, 1, 1), grid=(1, 1))
print('endtime = ', time.time() - st)
print(dest-a*b)
print(dest_gpu-a_gpu*b_gpu)
It shows me that the CUDA version of multiply is 5 times slower than using numpy cpu.
First I thought it was a overhead problem, but i checked the time after 800-900 iteration, the CUDA version is still much slower compared to the CPU numpy.
Is this a normal behaviour?
FYI: I dont think is a general CUDA issue, the xavier im using can run deepstream and jetson-inference perfectly with fast spped
Not sure for your case, but be aware that CUDA stuff may take some time to upload and setup, so better have separate timings for first iterations and get real throughput when things are running later.
Based on the implementation, the GPU kernel is triggered for only one calculation.
Since no parallel job is used, the execution time will be the sum of a sequential job (as CPU) plus some kernel launching overhead.
In general, it’s expected to performance some parallel task to benchmark GPU.
For example, a matrix add with each thread for an output index will be good.