I have ~1e7 threads that are working with the data from the same array that can itself be very large (~1e8). So I’m wondering if the following approach is sensible?
// s_xs - data sites (n elements, n ~ 1e8)
// s_ys - function values (n elements)
// g_xi - the points where the function should be interpolated (p elements, p ~ 1e7)
// g_fi - the interpolated function values (p elements)
__global__ void interpolate(float *g_xi, float *g_fi, unsigned int n, unsigned int p)
{
// data sites and the function values
__shared__ s_xs[BATCH_SIZE];
__shared__ s_fs[BATCH_SIZE];
for (int i = 0; i <= ceil(n / BATCH_SIZE); i++)
{
// copy a chunk from the global memory into shared memory
// start collecting partial sums
__syncthreads();
}
// write the result to global memory
}
Thank you!