# Code takes 6x longer with twice the elements

On my Titan V with 512 blocks of 128 and the product of them being the count this code takes about 480ms, with 1024 blocks of 128 it takes about 2880ms, that’s 6 times longer!
Any ideas why it scales so poorly?

``````__global__ void ComputeClosest(float2* p0, int* n, int count){
const auto i = blockIdx.x * blockDim.x + threadIdx.x;
for(auto j = 0; j < count; ++j){
auto dx = p0[j].x - p0[i].x;
auto dy = p0[j].y - p0[i].y;
const auto ds = dx * dx + dy * dy;
#pragma unroll
for(auto k = 0; k < 8; ++k){
const auto point = p0[n[i*8 + k]];
dx = point.x - p0[i].x;
dy = point.y - p0[i].y;
if(ds < dx * dx + dy * dy){
n[i*8 + k] = j;
break;
}
}
}
}
``````

You’re not specifying if you’re also scaling the count variable as well.

Assuming i and j both go up to the total number of threads (=count), your problem scales with quadratic complexity O(n^2). Doubling the problem size should hence amount to 4 times the run time. You’ve observed 6x the run time, which is 50% above expectation.

I recommend profiling this kernel with both launch grid sizes, and to look at cache related metrics in particular. It is to be expected that the L2 cache efficiency is lower with the bigger data set. Check the cache size on your GPU model. Bigger is better (e.g. 2.75MB in 1080Ti vs 4 MB in RTX 2080 vs 5.5MB in RTX 2080Ti)

Use of shared memory might be a way to save memory access to the p0 array. Try to tile the problem and load each tile into shared memory. This is very similar to optimizing the n-body problem. Look at related CUDA code samples.

It computes the entire array of count, you are right about the cache, 1024 takes p0 + n past the 4.5MB of L2 cache the Titan V has.
256 blocks of 128 takes 100ms and 128 blocks takes 28ms so the smaller sizes scale much better but I want much more elements without it taking all day.
I noticed the tensor cores do matrix multiply like I’m doing here, could I use the tensor cores on my Titan V to speed it up?

the tensor cores only compute half precision.

Besides, you are limited by global memory throughput, not compute speed. There are still options to achieve near optimal scaling. Blockwise loading into shared memory may help.

Actually I can live with half precision in this particular application and I don’t need square root.

It’s not at all obvious to me that you are doing a matrix multiply (either in the linear algebra sense of the term, or in the sense that applies to tensorcore usage, which is the linear algebra sense)

Is it not dx * dx + dy * dy?

I don’t want to seem ungrateful but so far this forum has got me nowhere is this project, you say things like “Try to tile the problem” with no example of how to do so, do I have to pay for code samples?

The CUDA code samples are a part of the CUDA toolkit installer. It’s an optional install.

The n-Body code sample is very similar in that it interacts every particle with every other particle and it shows how to do that (memory) efficiently. It does a bit more than you need by computing forces based on distance, while you’re mostly interested in minimum distance.

Also googling for “cuda n-body sample” would have provided relevant information.