Unable to perform matrix addition in PyCUDA - Jetson Xavier NX/CUDA 11.4.19

Hi,

I am running the following code on a Jetson Xavier NX with following specs:

  1. JetPack 5.0.2 - b231
  2. Ubuntu 20.04
  3. CUDA 11.4.19
  4. DeepStream 6.1
  5. GStreamer 1.16.3

Code:

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time

mod = SourceModule("""
__global__ void add_them(int *dest, int *a, int *b, int *c, int *d, int *e)
{
    const int row = blockIdx.y*blockDim.y + threadIdx.y;
    const int col = blockIdx.x*blockDim.x + threadIdx.x;
    int op_val;
    if(row<1280 && col<1280)
    {
        op_val = a[row*1280+col] + b[row*1280+col] + c[row*1280+col] + d[row*1280+col] + e[row*1280+col];
    }
    dest[row*1280+col] = op_val;

}
""")

add = mod.get_function("add_them")

a = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
b = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
c = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
d = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)
e = numpy.random.randint(1,10,size=(1280,720)).astype(numpy.int8)

dest = numpy.zeros_like(a)
st = time.time()

add(drv.Out(dest), drv.In(a), drv.In(b), drv.In(c), drv.In(d), drv.In(e),
         block=(32,20,1), grid=(40,36))

et = time.time()
print(et-st)
print(numpy.matrix(dest-(a+b+c+d+e)).sum())

The error I get is
Traceback (most recent call last):
File “cuda-test.py”, line 33, in
multiply(drv.Out(dest), drv.In(a), drv.In(b), drv.In(c), drv.In(d), drv.In(e),
File “/usr/local/lib/python3.8/dist-packages/pycuda-2022.2.2-py3.8-linux-aarch64.egg/pycuda/driver.py”, line 505, in function_call
Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered

I have no issue when operating on matrices of size 40x20, Block dim = (8,5,1) and Grid dim = (5,4). But when I use bigger matrices, I am running into this issue. Can you kindly help. Thanks

NOTE: Asterick sign is not getting displayed where required in the code

To properly format your code for this site, you should:

  1. click the edit (pencil) icon below your post
  2. select the code you have posted in the window
  3. press the </> button at the top of the edit pane
  4. save your changes

Please do that now.

My guess is the problem has to do with using int variables in the kernel prototype, but numpy.int8 variables in your data setup. These types don’t match.

1 Like

I was able to resolve the issue when I changed the type of numpy variables to float32 and kernel prototype variables to float. Thanks

I am new to CUDA programming and I am working on adding 3-d arrays using PyCUDA. I am having trouble with assigning the right thread indices in the kernel. Here is the code -


import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
import vpi

mod = SourceModule("""
__global__ void add_them(float *dest, float *a, float *b)
{
    int idx = blockIdx.y*blockDim.y + threadIdx.y;
    int idy = blockIdx.x*blockDim.x + threadIdx.x;
    int idz = blockIdx.z*blockDim.z + threadIdx.z;

    float op_val;
    if (idx < 1280 && idy < 1280 && idz < 1280)
    {
        op_val = a[idx*1280+idy+idz*921600] + b[idx*1280+idy+idz*921600]; 
    }
    dest[idx*1280+idy+idz*921600] = op_val;
}
""")

a = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
b = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
dest = numpy.zeros_like(a)

addition = mod.get_function("add_them")

addition(drv.Out(dest), drv.In(a), drv.In(b),
         block=(20,17,3), grid=(64,43))

print(((dest)-(a+b)).sum())

And the error message I get is

File “/usr/local/lib/python3.8/dist-packages/pycuda-2022.2.2-py3.8-linux-aarch64.egg/pycuda/driver.py”, line 505, in function_call
Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered

Can you kindly direct me to the right indexing technique in the kernel for 3d arrays. Thanks

You generally want this statement:

to accurately reflect your problem dimensions. Your problem dimension appears to be 1280x720x3. So you should be testing against each of those limits for the appropriate indices (x against 1280, y against 720, z against 3), and the usual way to do this is not to hard-code these but to pass them as kernel parameters.

Thanks for the reply. I have tried using the appropriate indices you mentioned -

kernel_finalImageRender = SourceModule("""
__global__ void finalImageRender(float *finalImg, float *front, float *left, float *bottom, float *right)
{
    const int row = blockIdx.y*blockDim.y + threadIdx.y;
    const int col = blockIdx.x*blockDim.x + threadIdx.x;
    const int dep = blockIdx.z*blockDim.z + threadIdx.z;
    float op;

    if(row<1280 && col<720 && dep<3)
    {
        op = front[row*1280+col+dep*1280*720] + left[row*1280+col+dep*1280*720] + bottom[row*1280+col+dep*1280*720] + right[row*1280+col+dep*1280*720];
    }
    finalImg[row*1280+col+dep*1280*720] = op;

}
""")

kernelObj_finalImageRender = kernel_finalImageRender.get_function("finalImageRender")

left = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
front = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
bottom = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
right = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
finalImg = numpy.zeros_like(left)


kernelObj_finalImageRender(drv.Out(finalImg), 
                           drv.In(left), drv.In(front), drv.In(bottom), drv.In(right),
                           block=(20,17,3), grid=(113,24))
print(((finalImg)-(left+front+bottom+right)).sum())

But I get 8834973000.0 as the actual result whereas my expected result is 0.0

Also I tried to pass dimensions as kernel parameters instead of hard coding.

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time

kernel_finalImageRender = SourceModule("""
__global__ void finalImageRender(float *finalImg, float *front, float *left, float *bottom, float *right, int width, int height, int dim)
{
    const int row = blockIdx.y*blockDim.y + threadIdx.y;
    const int col = blockIdx.x*blockDim.x + threadIdx.x;
    const int dep = blockIdx.z*blockDim.z + threadIdx.z;
    float op;

    if(row<width && col<height && dep<dim)
    {
        op = front[row*width+col+dep*width*height] + left[row*width+col+dep*width*height] + bottom[row*width+col+dep*width*height] + right[row*width+col+dep*width*height];
    }
    finalImg[row*width+col+dep*width*height] = op;

}
""")

kernelObj_finalImageRender = kernel_finalImageRender.get_function("finalImageRender")

left = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
front = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
bottom = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
right = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
finalImg = numpy.zeros_like(left)

randArrWidth = numpy.int32(left.shape[0])
randArrHeight = numpy.int32(left.shape[1])
randArrDim = numpy.int32(left.shape[2])

kernelObj_finalImageRender(drv.Out(finalImg), 
                           drv.In(left), drv.In(front), drv.In(bottom), drv.In(right),
                           drv.In(randArrWidth), drv.In(randArrHeight), drv.In(randArrDim),
                           block=(20,17,3), grid=(113,24))
print(((finalImg)-(left+front+bottom+right)).sum())

and the error I get is

Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered

Could you kindly advice. Thanks

You’re mixing up row and column dimensions and limits. The first grid dimension you have is 1280. That corresponds to the width, or col, and that should correspond to your kernel .x variables, and you should be testing that against width.

Your next grid dimension is 720. That corresponds to height, or row, and that should correspond to your kernel .y variables, and you should be testing that against height.

Finally, your 3rd grid dimension is depth, corresponding to 3, and the kernel .z variables. You should be testing that limit against depth (dep).

Furthermore, drvIn() is what you use to pass a pointer parameter (with implicit copying of pointed-to data). You do not use it to pass non-pointer parameters. please study any available pycuda sample code such as vector addition.

Thanks for the clarification. I made the changes you recommended but am still not getting the expected result.

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time

kernel_finalImageRender = SourceModule("""
__global__ void finalImageRender(float *finalImg, float *front, float *left, float *bottom, float *right, int width, int height, int dim)
{
    const int row = blockIdx.y*blockDim.y + threadIdx.y;
    const int col = blockIdx.x*blockDim.x + threadIdx.x;
    const int dep = blockIdx.z*blockDim.z + threadIdx.z;
    float op;

    if(row< height && col<width && dep<dim)
    {
        op = front[row*width+col+dep*width*height] + left[row*width+col+dep*width*height] + bottom[row*width+col+dep*width*height] + right[row*width+col+dep*width*height];
    }
    finalImg[row*width+col+dep*width*height] = op;

}
""")

kernelObj_finalImageRender = kernel_finalImageRender.get_function("finalImageRender")

left = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
front = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
bottom = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
right = numpy.random.randint(1,10,size=(1280,720,3)).astype(numpy.float32)
finalImg = numpy.zeros_like(left)

randArrWidth = numpy.int32(left.shape[0])
randArrHeight = numpy.int32(left.shape[1])
randArrDim = numpy.int32(left.shape[2])

kernelObj_finalImageRender(drv.Out(finalImg), 
                           drv.In(left), drv.In(front), drv.In(bottom), drv.In(right),
                           randArrWidth, randArrHeight, randArrDim,
                           block=(20,17,3), grid=(113,24))
print(((finalImg)-(left+front+bottom+right)).sum())

The expected output is 0 but I get a non-zero number.

Thanks for sharing the link. I did go through the samples but didn’t find any 3-d matrix scenarios. I did try the ElementWise kernel technique but its adding some latency to the execution.

Kindly advice. Thanks

You have a problem here:

You need to spin up enough threads in the grid to cover the entire image space. That means, that the product of the x dimensions must be equal to or greater than the width of your image. The product of the y dimensions must be equal to or greater than the height of your image, and the product of the z dimensions must be equal to or greater than your depth. I’m not really sure why you have chosen the numbers you have, but if we run through that arithmetic:

20x113=2260 which is greater than 1280
17x24=408 which is less than 720
3x1=3 which is equal to your depth

So you have a problem with the grid y dimension.

You could fix it as follows:

                       block=(20,17,3), grid=(113,43))

(and the number 113 doesn’t need to be that large, but its not causing a results problem)

You have another problem here:

As written, that kernel is protected from reading out of bounds by the if-statement, but not from writing out of bounds. The final assignment statement:

  finalImg[row*width+col+dep*width*height] = op;

should be in the body of the if-statement.

With those two changes, your code prints out a value of 0.0 for me.

1 Like

Thank you for the clarification. I was able to resolve it on my end as well.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.