Hello!

I am currently trying to design an efficient solution in CUDA to the following problem, it is, taking in account memory coalescing and shared memory bank conflicts, this kind of details you need to keep in mind to achieve a good speed in GPU. The simplified problem is the following, given several pairs of vectors (of float3), compute a function on the pairs of positions i and j of these pair of vectors. As confusing it sounds here is an idea of implementation in C (ish):

```
int CPUKernel(float3** first_vector, int* first_size, float3** second_vector, int* second_size, int total_vectors)
{
float res = 0;
for (int i = 0; i < total_vectors; ++i)
{
int sa = first_size[i];
int sb = second_size[i];
for (int j = 0; j < sa; ++j)
{
for (int k = 0; k < sb; ++k)
{
res += Function(first_vector[i][j], second_vector[i][k]);
}
}
}
return res;
}
```

In advance, I know that the size of each vector won’t be more than 1024, it is, the variable sa and sb will not be more than 2^10, and in general sa is much smaller than sb. Next is a sketch of what I have tried, I load the pairs in shared memory and do the computation there, I am using each block to load one pair of vectors, but I could not think on how can I maximize the usage of the threads in that block and also avoid bank conflicts, maybe I should design something different? :

```
__global__ void Kernel(float3** first_vector, int* first_size, float3** second_vector, int* second_size, int total_vectors)
{
__shared__ float res[THREADS];
__shared__ float3 shared_first[MAX_SHARED_SIZE]; // it is, 1024
__shared__ float3 shared_second[MAX_SHARED_SIZE]; // it is, 1024
res[threadIdx.x] = 0;
for (int b = blockIdx.x; b < total_vectors; b += BLOCKS)
{
int sa = first_size[b];
int sb = second_size[b];
// copy the cell content to shared memory
if (threadIdx.x < sa) shared_first[threadIdx.x] = first_vector[b][threadIdx.x];
if (threadIdx.x < sb) shared_second[threadIdx.x] = second_vector[b][threadIdx.x];
__syncthreads();
int total = sa*sb;
// bank conflicts everywhere here :/
for (int a = threadIdx.x; a < total; a += THREADS)
{
int i = a/sb;
int j = a%sb;
res[threadIdx.x] += Function(shared_first[i], shared_second[j]);
}
}
// reduce the res..
}
```

I really appreciate the help, I need different perspectives on this problem! Is there any know name for it that I could search for more answers? Sorry for the ignorance ;)

By the way, how can I use nvprof to check for global memory coalescing? (for shared memory bank conflicts I am using l1_shared_bank_conflict event).

Thanks!