Hi,

I am running the following code on a Jetson Xavier NX with following specs:

- JetPack 5.0.2 - b231
- Ubuntu 20.04
- CUDA 11.4.19
- DeepStream 6.1
- GStreamer 1.16.3

Code:

```
import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
mod = SourceModule("""
__global__ void add_them(int *dest, int *a, int *b, int *c, int *d, int *e)
{
const int row = blockIdx.y*blockDim.y + threadIdx.y;
const int col = blockIdx.x*blockDim.x + threadIdx.x;
int op_val;
if(row<1280 && col<1280)
{
op_val = a[row*1280+col] + b[row*1280+col] + c[row*1280+col] + d[row*1280+col] + e[row*1280+col];
}
dest[row*1280+col] = op_val;
}
""")
add = mod.get_function("add_them")
a = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
b = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
c = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
d = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
e = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
dest = numpy.zeros_like(a)
st = time.time()
add(drv.Out(dest), drv.In(a), drv.In(b), drv.In(c), drv.In(d), drv.In(e),
block=(32,20,1), grid=(40,36))
et = time.time()
print(et-st)
print(numpy.matrix(dest-(a+b+c+d+e)).sum())
```

The error I get is

Traceback (most recent call last):

File â€ścuda-test.pyâ€ť, line 33, in

multiply(drv.Out(dest), drv.In(a), drv.In(b), drv.In(c), drv.In(d), drv.In(e),

File â€ś/usr/local/lib/python3.8/dist-packages/pycuda-2022.2.2-py3.8-linux-aarch64.egg/pycuda/driver.pyâ€ť, line 505, in function_call

Context.synchronize()

pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered

I have no issue when operating on matrices of size 40x20, Block dim = (8,5,1) and Grid dim = (5,4). But when I use bigger matrices, I am running into this issue. Can you kindly help. Thanks

NOTE: Asterick sign is not getting displayed where required in the code