Hello,

I recently paid for and started going through the class “Fundamentals of Accelerated Computing with CUDA Python”.

- In the very last section " Multidimensional Grids and Shared Memory for CUDA Python with Numba" the lesson asks us to do an assighment as follows

In this exercise you will complete a matrix mulitply kernel that will use shared memory to cache values from the input matrices so that they only need be accessed from global memory once, after which calculations for a thread’s output element can utilize the cached values.This purpose of this assessment is to test your ability to reason about a 2D parallel problem and utilize shared memory. This particular problem doesn’t have a ton of arithmetic intensity, and we are not going to use a huge dataset so we will likely not see big speedups vs. the very simple CPU version. However, the ability to use the techniques asked of you will provide you ability in a wide number of situations where you will genuinely wish to accelerate some program involving a 2D dataset.

To keep the focus on shared memory, this problem assumes input vectors of MxN and NxM dimensions with NxN threads per block and M/N blocks per grid. This means that shared memory caches with elements equal to the number of threads per block will be sufficient to provide all elements from the input matrices necessary for the calculations, and that no grid striding will be required.

The following images shows the input matrices, the output matrix, a region of the output matrix that a block will calculate values for, the regions in the input matrices that this block will cache, and also, the output element and input elements for a single thread in that block:

The shared memory caches have already been allocated in the kernel, your task is twofold:

- Use each thread in the block to populate one element in each of the caches.
- Use the shared memory caches in calculating each thread’s
`sum`

value.Be sure to do any thread synchronizing that might be required to avoid cached values written by other threads not yet being available.

I have written a code that solves this, as I have tested it on

- A super computer at the University of Florida
- My own personal rig
- The High Altitude Observatories cluster.

Can someone out here tell me what explicity this asks from me? I have looked at This thread to no avail.

Can someone please help me?