Understanding benchmark results

Hi all,

I need help understanding the results of a benchmark, which are not what we thought they would be based on our naive understanding of the hardware.

The problem is quite simple: it’s a matrix (living in device-memory) for which the elements need to be calculated by a kernel. The matrix is hollow (A(i,i) = 0) and symmetrical (A(i,j) = A(j,i)), so in principle it is sufficient to launch N(N-1)/2 threads to account for every element. What we tested is how much performance we can gain from taking advantage of the symmetric nature of this matrix. Instead of having every thread map trivially to an element of the matrix (i.e. thread i maps to index i), we need a more complicated map to calculate indices in the lower (or upper) diagonal. We wanted to know if this extra overhead (involving a square-root computation) is acceptable.

We created two versions of the same kernel and ran a benchmark on the full NxN matrix (N = 418), and found the following:
External Media
In the above image, 1/runtime (so higher is better) is plotted agains #threads/#elements (so at the right boundary, the number of threads is equal to the number of elements).

The symmetric version outperforms the naive version on the entire range of threads. This must mean that the resources of our setup (Tesla M2075) are such that these numbers of threads occupy the hardware completely and it therefore makes sense to have less…right?

So how about smaller matrices? Like, ridiculously small matrices. We projected that, when we take a matrix of say N = 10 (10^2 = 100 elements), every thread should be able to run truly parallel in hardware. Therefore, every element can be calculated at the same time and there is no merit in going through the trouble of doing the symmetric version (which involves the more complicated mapping). However, this is not what we find:
External Media

Even for such small matrices (this is N=10 but the same holds even for N=3), the symmetric version keeps being faster. How is this possible?

Below, what I think are the relevant pieces of code. I have tried to keep it as short as possible.

Naive kernel

__global__ void constructAdjacency(int *matrix, Line const *tracks, size_t n, double thresh)
{
    size_t nThreads = blockDim.x * gridDim.x;
    size_t tid = threadIdx.x + blockIdx.x * blockDim.x;

    size_t const N = n * n;
    size_t idx = tid;
    while (idx < N)
    {
        size_t row = idx / n;
        size_t col = idx % n;

        matrix[idx] = (row != col)
            ? (distance(tracks[row], tracks[col])  < thresh)
            : 0;

        idx += nThreads;
    }
}

Symmetric version (spot the differences):

__global__ void Symm::constructAdjacency(int *matrix, Line const *tracks, size_t n, double thresh)
{
    size_t nThreads = blockDim.x * gridDim.x;
    size_t tid = threadIdx.x + blockIdx.x * blockDim.x;

    size_t N = n * (n - 1) / 2;
    size_t idx = tid;
    while (idx < N)
    {
        size_t row = (sqrt(1.0 + 8 * idx) + 1) / 2;
        size_t col = idx - row * (row - 1) / 2;

        matrix[row * n + col] = matrix[col * n + row] =
            (row != col)
            ? (distance(tracks[row], tracks[col])  < thresh)
            : 0;

        idx += nThreads;
    }
}

Benchmarking code (after allocation etc):

size_t const N = n * n;
vector<tuple<int, double, double>> timings;
int const maxThreads = atoi(argv[4]); // number of threads per block                                       

for (int totalThreads = 1; totalThreads <= N; ++totalThreads)
{
    cout << totalThreads << " / " << N << '\n';

    // calculate the number of blocks needed to house all threads
    int threads = std::min(totalThreads, maxThreads);
    int blocks = (totalThreads + threads - 1) / threads; 

    // time and repetitions
    double t1 = 0;
    double t2 = 0;
    size_t rep = 1e3;

    for (int count = 0; count != rep; ++count)
    {
        // Naive calculation
        cudaMemset(dev_matrix, 0, n * n * sizeof(int));
        t1 += measure("Construct Matrix on the device", [&]() {
                constructAdjacency<<< blocks, threads >>>(dev_matrix, dev_tracks, n, thresh);
            }, true);
        if (count == 0)
            cudaMemcpy(&host_matrix1[0], dev_matrix, n * n * sizeof(int), cudaMemcpyDeviceToHost);

        // Symmetric calculation
        cudaMemset(dev_matrix, 0, n * n * sizeof(int));
        t2 += measure("Construct Symmetrical Matrix on the device", [&]() {
                Symm::constructAdjacency<<< blocks, threads >>>(dev_matrix, dev_tracks, n, thresh);
            }, true);
        if (count == 0)
            cudaMemcpy(&host_matrix2[0], dev_matrix, n * n * sizeof(int), cudaMemcpyDeviceToHost);

        // Compare results (once)
        if (count == 0)
        {
            auto cmp = std::mismatch(host_matrix1.begin(), host_matrix1.end(), host_matrix2.begin());
            if (*cmp.first != *cmp.second)
            {
                std::cerr << "Result mismatch...\n";
                return 1;
            }
        }
    }

    timings.push_back(std::make_tuple(totalThreads, t1 / rep , t2 / rep));
}

The measure-function used above:

template <typename Function>
double measure(std::string const &msg, Function&& fun, bool silent = false)
{
    float elapsed = 0;

    if (!silent)
        cout << msg << " ... " << flush; // inform what's being done                                           
    cudaEventRecord(start, 0);
    fun();
    cudaEventRecord(done, 0);
    cudaEventSynchronize(done);
    cudaEventElapsedTime(&elapsed, start, done);

    if (!silent)
        std::cout << "done (" << elapsed << " ms)\n"
                  << "Error message: " << cudaGetErrorString(cudaGetLastError()) << "\n\n" << std::flush;

    return elapsed;
}

Without having looked at your code:
The triangular/symmetric version still only has half the memory accesses. Even if all computation can run truly parallel for small sizes, memory transfer is competing for the same few (at most 6 I believe, depending on which GPU you use) memory channels.

Thanks for your response. However, the symmetric version is addressing equally many memory locations:

matrix[row * n + col] = matrix[col * n + row] = /* etc */;

How does the distance() function look like? If it’s very memory-intensive, tera’s explanation still applies.

Hm, interesting… Hadn’t thought of that. I’m passing two elements from the Tracks array to the distance() function. This function itself isn’t memory-intensive, but of course the symmetric version needs only half the accesses to this array. This might be it… I’ll see what happens when I replace this function with a dummy that doesn’t need to go to global memory.

Thanks!

PS. This is what distance() looks like:

__device__ double distance(Line const &line1, Line const &line2)
{
    // (b1 - b2)
    double x[3] = 
    {
        line1.b[0] - line2.b[0],
        line1.b[1] - line2.b[1],
        line1.b[2] - line2.b[2]
    };
                   
    // (a1 x a2)
    double y[3] = 
    {
        line1.a[1] * line2.a[2] - line1.a[2] * line2.a[1],
        line1.a[2] * line2.a[0] - line1.a[0] * line2.a[2],
        line1.a[0] * line2.a[1] - line1.a[1] * line2.a[0]
    };

    // (b1 - b2) . (a1 x a2)
    double dotxy = x[0] * y[0] + x[1] * y[1] + x[2] * y[2];
    double normy = sqrt(y[0] * y[0] + y[1] * y[1] + y[2] * y[2]);
    
    // d_{12}
    return abs(dotxy / normy);
}

Side remark: Consider using rnorm3d() in your distance calculation:

return abs(dotxy * rnorm3d ([y0], y[1], y[2]);