Hi, sorry to bother, I have issue with the last exercise of the final assessment of the course “Getting Started with Accelerated Computing in Modern CUDA C++”.
Last exercise requires to use cub::BlockReduce in order to compute average values for a coarse grid. Here it is the code:
%%writefile Sources/coarse.cu
#include "dli.h"
__global__ void kernel(dli::temperature_grid_f fine,
dli::temperature_grid_f coarse) {
//int coarse_row = blockIdx.x / coarse.extent(1);
//int coarse_col = blockIdx.x % coarse.extent(1);
//int row = threadIdx.x / dli::tile_size;
//int col = threadIdx.x % dli::tile_size;
//int fine_row = coarse_row * dli::tile_size + row;
//int fine_col = coarse_col * dli::tile_size + col;
//float thread_value = fine(fine_row, fine_col);
// FIXME(Step 3):
// Compute the sum of `thread_value` across threads of a thread block
// using `cub::BlockReduce`
//using BlockReduce = cub::BlockReduce<float, dli::block_threads>;
// Allocate shared memory for BlockReduce
//__shared__ BlockReduce::TempStorage temp_storage;
//float block_sum = BlockReduce{temp_storage}.Sum(thread_value);
//__sincthreads();
// FIXME(Step 3):
// `cub::BlockReduce` returns block sum in thread 0, make sure to write
// result only from the first thread of the block
//if (threadIdx.x == 0)
// coarse(coarse_row, coarse_col) = block_sum;
}
// Don't change the signature of this function
void coarse(dli::temperature_grid_f fine, dli::temperature_grid_f coarse) {
kernel<<<coarse.size(), dli::block_threads>>>(fine, coarse);
}
It’s working correctly, but it’s not fast enough in order to pass the assessment. As you can see, I commented all the code inside the kernel in order to check that, when doing nothing, I reach the throughput required by the exercise, but this is not the case! (image below)
Required throughput is more than 3.5 billions cells per second. What am I doing wrong? Do I need to modify something else? Because it’s not clear from the instructions.
