Help with NVIDIA DLI Class "Fundamentals of Accelerated Computing with CUDA Python"

I recently paid for and started going through the class “Fundamentals of Accelerated Computing with CUDA Python”.

  1. In the very last section " Multidimensional Grids and Shared Memory for CUDA Python with Numba" the lesson asks us to do an assighment as follows

In this exercise you will complete a matrix mulitply kernel that will use shared memory to cache values from the input matrices so that they only need be accessed from global memory once, after which calculations for a thread’s output element can utilize the cached values.This purpose of this assessment is to test your ability to reason about a 2D parallel problem and utilize shared memory. This particular problem doesn’t have a ton of arithmetic intensity, and we are not going to use a huge dataset so we will likely not see big speedups vs. the very simple CPU version. However, the ability to use the techniques asked of you will provide you ability in a wide number of situations where you will genuinely wish to accelerate some program involving a 2D dataset.

To keep the focus on shared memory, this problem assumes input vectors of MxN and NxM dimensions with NxN threads per block and M/N blocks per grid. This means that shared memory caches with elements equal to the number of threads per block will be sufficient to provide all elements from the input matrices necessary for the calculations, and that no grid striding will be required.

The following images shows the input matrices, the output matrix, a region of the output matrix that a block will calculate values for, the regions in the input matrices that this block will cache, and also, the output element and input elements for a single thread in that block:

The shared memory caches have already been allocated in the kernel, your task is twofold:

  1. Use each thread in the block to populate one element in each of the caches.
  2. Use the shared memory caches in calculating each thread’s sum value.

Be sure to do any thread synchronizing that might be required to avoid cached values written by other threads not yet being available.

I have written a code that solves this, as I have tested it on

  1. A super computer at the University of Florida
  2. My own personal rig
  3. The High Altitude Observatories cluster.

Can someone out here tell me what explicity this asks from me? I have looked at This thread to no avail.

Can someone please help me?

Here is the code that I have tested and works:

import numpy as np
from numba import cuda, types
from numpy import testing
# Leave the values in this cell alone
M = 128 
N = 32 
# Input vectors of MxN and NxM dimensions
a = np.arange(M*N).reshape(M,N).astype(np.int32)
b = np.arange(M*N).reshape(N,M).astype(np.int32)
c = np.zeros((M, M)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
# NxN threads per block, in 2 dimensions
block_size = (N,N)
# MxM/NxN blocks per grid, in 2 dimensions
grid_size = (int(M/N),int(M/N))

def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
if column >= c.shape[0] or row >= c.shape[1]:
    # Quit if (x, y) is outside of valid C boundary
for i in range(bpg): #bpg == 32? or 16?
    # Preload data into shared memory
    a_cache[tx, ty] = a[column, ty + i * N]
    b_cache[tx, ty] = b[tx + i * N, row]
    # Wait until all threads finish preloading
    # Computes partial product on the shared memory
    for j in range(N):
        sum += a_cache[tx, j] * b_cache[j, ty]
    # Wait until all threads finish computing
c[column, row] = sum
# There's no need to update this kernel launch
mm_shared[grid_size, block_size](d_a, d_b, d_c)
# Do not modify the contents in this cell
solution = a@b
output = d_c.copy_to_host()
# This assertion will fail until you correctly update the kernel above.
print(testing.assert_array_equal(output, solution))

But, it does not when I submit the assignment.