Hi all,

I need help understanding the results of a benchmark, which are not what we thought they would be based on our naive understanding of the hardware.

The problem is quite simple: it’s a matrix (living in device-memory) for which the elements need to be calculated by a kernel. The matrix is hollow (A(i,i) = 0) and symmetrical (A(i,j) = A(j,i)), so in principle it is sufficient to launch N(N-1)/2 threads to account for every element. What we tested is how much performance we can gain from taking advantage of the symmetric nature of this matrix. Instead of having every thread map trivially to an element of the matrix (i.e. thread i maps to index i), we need a more complicated map to calculate indices in the lower (or upper) diagonal. We wanted to know if this extra overhead (involving a square-root computation) is acceptable.

We created two versions of the same kernel and ran a benchmark on the full NxN matrix (N = 418), and found the following:

In the above image, 1/runtime (so higher is better) is plotted agains #threads/#elements (so at the right boundary, the number of threads is equal to the number of elements).

The symmetric version outperforms the naive version on the entire range of threads. This must mean that the resources of our setup (Tesla M2075) are such that these numbers of threads occupy the hardware completely and it therefore makes sense to have less…right?

So how about smaller matrices? Like, ridiculously small matrices. We projected that, when we take a matrix of say N = 10 (10^2 = 100 elements), every thread should be able to run truly parallel in hardware. Therefore, every element can be calculated at the same time and there is no merit in going through the trouble of doing the symmetric version (which involves the more complicated mapping). However, this is not what we find:

Even for such small matrices (this is N=10 but the same holds even for N=3), the symmetric version keeps being faster. How is this possible?

Below, what I think are the relevant pieces of code. I have tried to keep it as short as possible.

Naive kernel

```
__global__ void constructAdjacency(int *matrix, Line const *tracks, size_t n, double thresh)
{
size_t nThreads = blockDim.x * gridDim.x;
size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
size_t const N = n * n;
size_t idx = tid;
while (idx < N)
{
size_t row = idx / n;
size_t col = idx % n;
matrix[idx] = (row != col)
? (distance(tracks[row], tracks[col]) < thresh)
: 0;
idx += nThreads;
}
}
```

Symmetric version (spot the differences):

```
__global__ void Symm::constructAdjacency(int *matrix, Line const *tracks, size_t n, double thresh)
{
size_t nThreads = blockDim.x * gridDim.x;
size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
size_t N = n * (n - 1) / 2;
size_t idx = tid;
while (idx < N)
{
size_t row = (sqrt(1.0 + 8 * idx) + 1) / 2;
size_t col = idx - row * (row - 1) / 2;
matrix[row * n + col] = matrix[col * n + row] =
(row != col)
? (distance(tracks[row], tracks[col]) < thresh)
: 0;
idx += nThreads;
}
}
```

Benchmarking code (after allocation etc):

```
size_t const N = n * n;
vector<tuple<int, double, double>> timings;
int const maxThreads = atoi(argv[4]); // number of threads per block
for (int totalThreads = 1; totalThreads <= N; ++totalThreads)
{
cout << totalThreads << " / " << N << '\n';
// calculate the number of blocks needed to house all threads
int threads = std::min(totalThreads, maxThreads);
int blocks = (totalThreads + threads - 1) / threads;
// time and repetitions
double t1 = 0;
double t2 = 0;
size_t rep = 1e3;
for (int count = 0; count != rep; ++count)
{
// Naive calculation
cudaMemset(dev_matrix, 0, n * n * sizeof(int));
t1 += measure("Construct Matrix on the device", [&]() {
constructAdjacency<<< blocks, threads >>>(dev_matrix, dev_tracks, n, thresh);
}, true);
if (count == 0)
cudaMemcpy(&host_matrix1[0], dev_matrix, n * n * sizeof(int), cudaMemcpyDeviceToHost);
// Symmetric calculation
cudaMemset(dev_matrix, 0, n * n * sizeof(int));
t2 += measure("Construct Symmetrical Matrix on the device", [&]() {
Symm::constructAdjacency<<< blocks, threads >>>(dev_matrix, dev_tracks, n, thresh);
}, true);
if (count == 0)
cudaMemcpy(&host_matrix2[0], dev_matrix, n * n * sizeof(int), cudaMemcpyDeviceToHost);
// Compare results (once)
if (count == 0)
{
auto cmp = std::mismatch(host_matrix1.begin(), host_matrix1.end(), host_matrix2.begin());
if (*cmp.first != *cmp.second)
{
std::cerr << "Result mismatch...\n";
return 1;
}
}
}
timings.push_back(std::make_tuple(totalThreads, t1 / rep , t2 / rep));
}
```

The measure-function used above:

```
template <typename Function>
double measure(std::string const &msg, Function&& fun, bool silent = false)
{
float elapsed = 0;
if (!silent)
cout << msg << " ... " << flush; // inform what's being done
cudaEventRecord(start, 0);
fun();
cudaEventRecord(done, 0);
cudaEventSynchronize(done);
cudaEventElapsedTime(&elapsed, start, done);
if (!silent)
std::cout << "done (" << elapsed << " ms)\n"
<< "Error message: " << cudaGetErrorString(cudaGetLastError()) << "\n\n" << std::flush;
return elapsed;
}
```