Need help in implementing Matrix multiplication using Shared Memory in Numba

I am getting error as “Your code did not produce the correct output”. Can anyone please help me in resolving this. I have pasted the code below:

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0

# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)

# TODO: use each thread to populate one element each a_cache and b_cache
if column >= c.shape[0] and row >= c.shape[1]:
    return

for i in range(cuda.gridDim.x):
    # Preload data into shared memory
    a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[column, cuda.threadIdx.y + i * block_size[0]]
    b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * block_size[0], row]

    # Wait until all threads finish preloading
    cuda.syncthreads()
    for j in range(block_size[0]):
        # TODO: calculate the `sum` value correctly using values from the cache 
        sum += a_cache[cuda.threadIdx.x][j] * b_cache[j][cuda.threadIdx.y]
    cuda.syncthreads()
    
c[column][row] = sum

Driver Code to test above Function

import numpy as np
from numba import cuda, types

Leave the values in this cell alone

M = 128
N = 32

Input vectors of MxN and NxM dimensions

a = np.arange(MN).reshape(M,N).astype(np.int32)
b = np.arange(M
N).reshape(N,M).astype(np.int32)
c = np.zeros((M, M)).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device©

NxN threads per block, in 2 dimensions

block_size = (N,N)

MxM/NxN blocks per grid, in 2 dimensions

grid_size = (int(M/N),int(M/N))
print(type(block_size))
print()

mm_shared[grid_size, block_size](d_a, d_b, d_c)

This may be of interest:

https://numba.pydata.org/numba-doc/dev/cuda/examples.html

No, it is also not working. The code in the above link not even passing the first phase of testing.

1 Like

The examples Robert directed you to work as expected. I’ve provided an example, using the link above, solving a 32x32. I’ll let you expand it to your problem size. The only thing missing from the link is kernel instantiation.

I suggest you re-read Section 3.2 before continuing https://numba.pydata.org/numba-doc/dev/cuda/kernels.html.

from numba import cuda, float32
import numpy as np

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp
    
@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp


# Main
np.random.seed(1234)
M = 32
N = 32

A = np.random.randint(5, size=(M, N))
B = np.random.randint(5, size=(M, N))
C1 = np.zeros_like(A)
C2 = np.zeros_like(A)
C3 = np.zeros_like(A)

print(A)
print(B)

# Using NumPy
C1 = A.dot(B)
print(C1)

# Using naive Numba
matmul[(2,2),(16,16)](A,B,C2)
print(C2)

# Using Shared Memory Numba
fast_matmul[(2,2),(16,16)](A,B,C3)
print(C3)

print(np.array_equal(C1,C2))
print(np.array_equal(C1,C3))

I did add the code for kernel invocation and tested. I copied the code from the link provided by the Robert and modified it according the exercise given in the jupyter notebook.

I have pasted the entire code below. Please look into it.

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    TPB = N
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=types.float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=types.float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp


# Driver Code for kernel launching

M = 128
N = 32

# Input vectors of MxN and NxM dimensions
a = np.arange(M*N).reshape(M,N).astype(np.int32)
b = np.arange(M*N).reshape(N,M).astype(np.int32)
c = np.zeros((M, M)).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# NxN threads per block, in 2 dimensions
block_size = (N,N)
# MxM/NxN blocks per grid, in 2 dimensions
grid_size = (int(M/N),int(M/N))


# Launching the Kernel
mm_shared[grid_size, block_size](d_a, d_b, d_c)

The problem is this code is passing the below test which given in the jupyter notebook but not the final testing.

# Do not modify the contents in this cell
from numpy import testing
solution = a@b
output = d_c.copy_to_host()
# This assertion will fail until you correctly update the kernel above.
testing.assert_array_equal(output, solution)

The final testing is done when I click the “Assess Task” button in the course homepage. I am stuck at this point, there is no proper error message is being displayed there. It only tells that “Your Code did not produce the correct output”. I am pasting the complete error that was displayed below.

Grade Feedback
Your code did not produce the correct output. +0 pts
Your code is fast enough. +50 pts
You did not pass, please try again.
Score: 50/100

1 Like

Your code is broken. I’m not going to debug your code for you. I’ll give you some examples of how it is broken.

This is wrong:

if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary

That and in the first line should be an or.

This is wrong:

grid_size = (int(M/N),int(M/N))

Note that the purpose of the exercise here (DLI intro to numba CUDA python) is not for others to debug the code for you. It is for you to get the code working correctly to pass the test.

1 Like

I am encountering the same issue in “Multidimensional Grids and Shared Memory for CUDA Python with Numba”
Assertion did not fail, but grade feedback shows “Your code did not produce the correct output. +0 pts”
Code below FYR. As the instruction stated, only the “assessment part” is required for modification.

import numpy as np
from numba import cuda
# Leave the values in this cell alone
M = 128
N = 32

# Input vectors of MxN and NxM dimensions
a = np.arange(M*N).reshape(M,N).astype(np.int32)
b = np.arange(M*N).reshape(N,M).astype(np.int32)
c = np.zeros((M, M)).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# NxN threads per block, in 2 dimensions
block_size = (N,N)
# MxM/NxN blocks per grid, in 2 dimensions
grid_size = (int(M/N),int(M/N))

assessment part

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
    column, row = cuda.grid(2)
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)

    # TODO: use each thread to populate one element each a_cache and b_cache
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x
    
    for i in range(bpg):
        # Preload data into shared memory
        a_cache[tx, ty] = a[column, ty + i * N]
        b_cache[tx, ty] = b[tx + i * N, row]
 
        # Wait until all threads finish preloading
        cuda.syncthreads()
 
        # Computes partial product on the shared memory
        for j in range(N):
            sum += a_cache[tx, j] * b_cache[j, ty]
 
        # Wait until all threads finish computing
        cuda.syncthreads()
 
    c[column, row] = sum
# There's no need to update this kernel launch
mm_shared[grid_size, block_size](d_a, d_b, d_c)
# Do not modify the contents in this cell
from numpy import testing
solution = a@b
output = d_c.copy_to_host()
# This assertion will fail until you correctly update the kernel above.
print(testing.assert_array_equal(output, solution))
1 Like

Its frustrated to debug the code while the assertion do not fail at all
https://imgur.com/BcIerSJ

1 Like

I think there must be bug in the exercise given by the Nvidia. If you look into Robert comment above, he have mentioned that code

grid_size = (int(M/N),int(M/N))

is wrong. But this code is provided by Nvidia and we were not allowed to modify it as a part of the assignment. So, there is something wrong from their side.

I understand your frustration even I went through same before coming to the forum. It dint help here.

The main problem as I said above is that, no proper error message is displayed when we click the Assess button in the course page. Therefore, it is extremely difficult to debug the code or I do not find any way to debug the code after clicking assess button.

2 Likes

Let me clarify a few things.

  1. I teach that course. I have taught it many times. The final test works correctly if done properly. Yes, it can be difficult.

  2. When I said the above code was “wrong” I was referring to a general implementation. The specific issue is that that formulation only works if M is whole-number divisible by N (you should be able to figure that out just by studying that line of code.) So, for a general implementation of matrix-matrix multiply, you would not want to write it that way. However, for a test question, if the instructor knows during the assessment that the problem will only involve testing against cases where M is whole-number divisible by N, then there is nothing wrong with that formulation for testing purposes. It doesn’t mean that the test question itself is wrong.

  3. Developing a fully worked answer to a test question that NVIDIA regularly uses and posting it publicly, would be kind of silly. So I’m not going debug your test answer fully for you here, and post it. And if I observe what appears to be a fully debugged proper answer, I’m going to delete it, or perhaps delete this whole thread. This isn’t the place to do that. I don’t recommend doing that on any public forum.

1 Like

Eventually figured out the issue. Its related to the relationship between “size of shared memory” and those (M,N) or (N,M).

2 Likes

Can you please elaborate what do mean by relationship between size of shared memory and matrices? Is different loop required to load data into shared memory?

Can you please at least tell where assessment fails (any code above) because it is definitely not and/or problem nor int(M/N) part?
Like training mentioned it is frustrating to debug when you are not getting any error messages and have no idea what is wrong.

3 Likes

This exercise is rather frustrating and perhaps illustrates that my understanding of the fundamentals is not correct. However I too have also looked at the code provided from https://numba.pydata.org/numba-doc/dev/cuda/examples.html

But I think I do not grasp what is going on very well despite reading examples, and examples on the internet, there appears rather a leap of what is happening and going on with this particular example and not being able to debug easily makes this a very frustrating experience.

Are there any resources that people can share that have illuminated their understanding?

1 Like

block_size: (32, 32)
grid_size: (4, 4)

@Robert_Crovella I agree with bal80n; I’ve also implemented the solution correctly but can not get the Assessment to work, despite having read the lesson numerous times and passing the test in the Jupyter Notebook. There is no clear message indicating the source of the error. Any help on this is appreciated.

Has there been an update?

Have you got any clues yet?

Can you please explain further

There is no assertion error for me in the Jupyter notebook but when I submit the assignment it tells me that the output is wrong.
Can anyone please help here?

Thanks