Problem with final assessment of "Getting Started with accelerated computing in modern cuda c++"

Hi, sorry to bother, I have issue with the last exercise of the final assessment of the course “Getting Started with Accelerated Computing in Modern CUDA C++”.

Last exercise requires to use cub::BlockReduce in order to compute average values for a coarse grid. Here it is the code:

%%writefile Sources/coarse.cu
#include "dli.h"

__global__ void kernel(dli::temperature_grid_f fine,
                       dli::temperature_grid_f coarse) {
  //int coarse_row = blockIdx.x / coarse.extent(1);
  //int coarse_col = blockIdx.x % coarse.extent(1);
  //int row = threadIdx.x / dli::tile_size;
  //int col = threadIdx.x % dli::tile_size;
  //int fine_row = coarse_row * dli::tile_size + row;
  //int fine_col = coarse_col * dli::tile_size + col;

  //float thread_value = fine(fine_row, fine_col);

  // FIXME(Step 3):
  // Compute the sum of `thread_value` across threads of a thread block
  // using `cub::BlockReduce`
  //using BlockReduce = cub::BlockReduce<float, dli::block_threads>;

  // Allocate shared memory for BlockReduce
  //__shared__  BlockReduce::TempStorage temp_storage;
  //float block_sum = BlockReduce{temp_storage}.Sum(thread_value);

  //__sincthreads();

  // FIXME(Step 3):
  // `cub::BlockReduce` returns block sum in thread 0, make sure to write
  // result only from the first thread of the block
  //if (threadIdx.x == 0)
  //    coarse(coarse_row, coarse_col) = block_sum;
}

// Don't change the signature of this function
void coarse(dli::temperature_grid_f fine, dli::temperature_grid_f coarse) {
  kernel<<<coarse.size(), dli::block_threads>>>(fine, coarse);
}

It’s working correctly, but it’s not fast enough in order to pass the assessment. As you can see, I commented all the code inside the kernel in order to check that, when doing nothing, I reach the throughput required by the exercise, but this is not the case! (image below)

Required throughput is more than 3.5 billions cells per second. What am I doing wrong? Do I need to modify something else? Because it’s not clear from the instructions.

For DLI Course assistance, please visit the DLI Support Page to open a support case. NVIDIA Deep Learning Institute (DLI) Support

This ensures your request is directed to the appropriate team for timely and accurate support.
When submitting your case, please make sure to select the product type “NVIDIA Training and Certification”.