Code takes 6x longer with twice the elements

CommanderLake · February 21, 2020, 3:39am

On my Titan V with 512 blocks of 128 and the product of them being the count this code takes about 480ms, with 1024 blocks of 128 it takes about 2880ms, that’s 6 times longer!
Any ideas why it scales so poorly?

__global__ void ComputeClosest(float2* p0, int* n, int count){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    for(auto j = 0; j < count; ++j){
        auto dx = p0[j].x - p0[i].x;
        auto dy = p0[j].y - p0[i].y;
        const auto ds = dx * dx + dy * dy;
#pragma unroll
        for(auto k = 0; k < 8; ++k){
            const auto point = p0[n[i*8 + k]];
            dx = point.x - p0[i].x;
            dy = point.y - p0[i].y;
            if(ds < dx * dx + dy * dy){
                n[i*8 + k] = j;
                break;
            }
        }
    }
}

cbuchner1 · February 21, 2020, 3:30pm

You’re not specifying if you’re also scaling the count variable as well.

Assuming i and j both go up to the total number of threads (=count), your problem scales with quadratic complexity O(n^2). Doubling the problem size should hence amount to 4 times the run time. You’ve observed 6x the run time, which is 50% above expectation.

I recommend profiling this kernel with both launch grid sizes, and to look at cache related metrics in particular. It is to be expected that the L2 cache efficiency is lower with the bigger data set. Check the cache size on your GPU model. Bigger is better (e.g. 2.75MB in 1080Ti vs 4 MB in RTX 2080 vs 5.5MB in RTX 2080Ti)

Use of shared memory might be a way to save memory access to the p0 array. Try to tile the problem and load each tile into shared memory. This is very similar to optimizing the n-body problem. Look at related CUDA code samples.

CommanderLake · February 21, 2020, 6:54pm

It computes the entire array of count, you are right about the cache, 1024 takes p0 + n past the 4.5MB of L2 cache the Titan V has.
256 blocks of 128 takes 100ms and 128 blocks takes 28ms so the smaller sizes scale much better but I want much more elements without it taking all day.
I noticed the tensor cores do matrix multiply like I’m doing here, could I use the tensor cores on my Titan V to speed it up?

cbuchner1 · February 21, 2020, 7:00pm

the tensor cores only compute half precision.

Besides, you are limited by global memory throughput, not compute speed. There are still options to achieve near optimal scaling. Blockwise loading into shared memory may help.

CommanderLake · February 21, 2020, 7:03pm

Actually I can live with half precision in this particular application and I don’t need square root.

Robert_Crovella · February 21, 2020, 8:18pm

It’s not at all obvious to me that you are doing a matrix multiply (either in the linear algebra sense of the term, or in the sense that applies to tensorcore usage, which is the linear algebra sense)

CommanderLake · February 21, 2020, 8:21pm

Is it not dx * dx + dy * dy?

Robert_Crovella · February 21, 2020, 8:24pm

It is not.

https://en.wikipedia.org/wiki/Matrix_multiplication

CommanderLake · February 21, 2020, 8:38pm

I don’t want to seem ungrateful but so far this forum has got me nowhere is this project, you say things like “Try to tile the problem” with no example of how to do so, do I have to pay for code samples?

cbuchner1 · February 21, 2020, 8:44pm

The CUDA code samples are a part of the CUDA toolkit installer. It’s an optional install.

The n-Body code sample is very similar in that it interacts every particle with every other particle and it shows how to do that (memory) efficiently. It does a bit more than you need by computing forces based on distance, while you’re mostly interested in minimum distance.

Also googling for “cuda n-body sample” would have provided relevant information.

Like this article (old but still relevant): https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-31-fast-n-body-simulation-cuda

CommanderLake · February 21, 2020, 8:59pm

I cant use code that I cant understand.

cbuchner1 · February 21, 2020, 9:18pm

Be willing to learn. In particular how to apply tiling schemes to memory bandwidth constrained problems.

Matrix multiplication has very similar constraints, And the solution is to load tiles of data into shared memory before accessing the elements in the tile multiple times.

Maybe get an introductory book about CUDA. You really need to understand the memory hierarchies of the device. The use of shared memory is often taught in the context of a matrix multiplication. That knowledge then transfers well to your problem.

CommanderLake · February 21, 2020, 9:26pm

My Aspergers makes reading near impossible, my brain gets stuck in loops, I just cant learn from looking at complex algorithms and lengthy explanations, I really want to make use of CUDA but I’m losing the will to live.

cbuchner1 · February 21, 2020, 9:36pm

if you’re lucky you will find online videos and webinars covering this or related topics.