I have some pretty efficient code for pairwise distance matrix computation (using pycuda)
2 global void mb_kern(double *res, double *m_in, int tpb_x, int tpb_y, int *ts)
3 {
4 const uint tx = threadIdx.x;
5 const uint ty = threadIdx.y;
6
7 const uint bx = blockIdx.x;
8 const uint by = blockIdx.y;
9
10 int i, j, k;
11 int t = by;
12
13 int i_tn = int(200 / tpb_x);
14 int i_ts = ts[bx * 20 + t] + i_tn * tx;
15 if (tx == tpb_x - 1) i_tn += int(200 % tpb_x);
16
17 int j_tn = int(300 / tpb_y);
18 int j_ts = ts[bx * 20 + t] + j_tn * ty;
19 if (ty == tpb_y - 1) j_tn += int(300 % tpb_y);
20
21 double aux = 0.0;
22 double d = 0.0;
23 for (i = i_ts; i < (i_ts + i_tn); i++) {
24 for (j = j_ts; j < (j_ts + j_tn); j++) {
25 for (k = 0; k < 10; k++) {
26 d = m_in[i * 10 + k] - m_in[j * 10 + k];
27 aux += __expf(-d * d);
28 }
29 }
30 }
31
32 res[((bx * 20 + by) * tpb_x + tx) * tpb_y + ty] = aux;
33 }
When I set the tuning parameters to “tpb_x = 8” and “tpb_y = 8”, the above code is quite efficent.
Unfortunately, the above code is too slow for my purposes even after parameter optimization.
Is there a way to improve the speed of the above code using e.g. shared memory?
You could possibly make the code more readable for people trying to read the question by adding indentation.
As for possible improvements, you seem to have nested for loops in your code. I’m not sure what the compiler does exactly to help out here, but if you look at [url]https://www.amazon.com/Professional-CUDA-Programming-John-Cheng/dp/1118739329[/url] on page 114, they explain about how you can “unroll” a for loop, by simply writing all the code instead of having it done by a lot of truth value checks.
There’s also the issue of launch configuration, since a lot is determined by your grid and block sizes.
Since the aux variable is a sum of sum kind, you can also possibly parallelize the sum calculation by having threads designated specifically for calculating the sum of each location. More of this can be read in the same book, page 104.
If you’re not interested in buying the book, simply Googling the “loop unrolling” or “parallel reduction” might give you some answers.
Below you have a piece of code that is the has helped me a lot. In this case I’ve assumed the length of vectors and the width and height of the thread block to all be 16. You need some modifications, if that is not true. Note that tx is used for two purposes, first to index the elements in the load to the shared memory and then as an index for point set 1. Also note that I’ve added a dummy column in the shared memory to avoid conflicts when data is loaded.
__shared__ float points1[17*16]; // points stored in columns
__shared__ float points2[17*16]; // one extra column to avoid shared memory conflicts
p1 = pointer for set 1 given by bx and ty
p2 = pointer for set 2 given by by and ty
points1[17*tx + ty] = p1[tx]; // load and transpose, tx used as element index
points2[17*tx + ty] = p2[tx];
__syncthreads();
float sum = 0.0f;
for (int j=0;j<16;j++)
sum += points1[17*j + tx] * points2[17*j + ty]; // tx used as a point set 1 index
res[...] = sum;