Hi,

I am trying to use CUDA coding to accelerate my program on Jetson Xavier NX, I installed the pycuda using ‘pip install pycuda==2019.1.2’.

When I run the following program on Jetson:

```
from pycuda import compiler, gpuarray, autoinit, driver
import numpy
import time
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply")
a = numpy.random.randn(1000).astype(numpy.float32)
b = numpy.random.randn(1000).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)
dest = numpy.zeros_like(a)
dest_gpu = gpuarray.to_gpu(dest)
for i in range(10000):
st = time.time()
dest = a*b
print('endtime = ', time.time() - st)
print('------------')
for i in range(10000):
st = time.time()
multiply_them(dest_gpu, a_gpu, b_gpu,
block=(1000, 1, 1), grid=(1, 1))
print('endtime = ', time.time() - st)
print(dest-a*b)
print(dest_gpu-a_gpu*b_gpu)
```

It shows me that the CUDA version of multiply is 5 times slower than using numpy cpu.

First I thought it was a overhead problem, but i checked the time after 800-900 iteration, the CUDA version is still much slower compared to the CPU numpy.

Is this a normal behaviour?

FYI: I dont think is a general CUDA issue, the xavier im using can run deepstream and jetson-inference perfectly with fast spped