I have ~1e7 threads that are working with the data from the same array that can itself be very large (~1e8). So I’m wondering if the following approach is sensible?
// s_xs - data sites (n elements, n ~ 1e8)
// s_ys - function values (n elements)
// g_xi - the points where the function should be interpolated (p elements, p ~ 1e7)
// g_fi - the interpolated function values (p elements)
__global__ void interpolate(float *g_xi, float *g_fi, unsigned int n, unsigned int p)
// data sites and the function values
for (int i = 0; i <= ceil(n / BATCH_SIZE); i++)
// copy a chunk from the global memory into shared memory
// start collecting partial sums
// write the result to global memory
From the comments that you have given what I understood, you are expecting s_xs and s_ys (shared memory) to be of size n.
But, in your CUDA kernel you have given:
There are some mismatches… right???
shared s_xs[BATCH_SIZE]; is not allocating the memory for a thread. It will allocate the shared memory for a block. For copying the values from global memory to shared memory, you have to distribute the task among the threads.
Also, if you need a shared memory with size â€˜nâ€™ (in your case 1e8) for each block, it will be beyond the shared memory limit of the GPUs (16KB per multiprocessor â€“ i.e per block).
Depending on how your points are distributed you might be better off using a texture (and thus the texture cache) instead of shared memory.