Copy to shared memory

I have ~1e7 threads that are working with the data from the same array that can itself be very large (~1e8). So I’m wondering if the following approach is sensible?

// s_xs - data sites (n elements, n ~ 1e8)

// s_ys - function values (n elements)

// g_xi - the points where the function should be interpolated (p elements, p ~ 1e7)

// g_fi - the interpolated function values (p elements)

__global__ void interpolate(float *g_xi, float *g_fi, unsigned int n, unsigned int p)


	// data sites and the function values

	__shared__ s_xs[BATCH_SIZE];

	__shared__ s_fs[BATCH_SIZE];


	for (int i = 0; i <= ceil(n / BATCH_SIZE); i++)


		// copy a chunk from the global memory into shared memory


		// start collecting partial sums





	// write the result to global memory


Thank you!

From the comments that you have given what I understood, you are expecting s_xs and s_ys (shared memory) to be of size n.
But, in your CUDA kernel you have given:
shared s_xs[BATCH_SIZE];
shared s_fs[BATCH_SIZE];
There are some mismatches… right???
shared s_xs[BATCH_SIZE]; is not allocating the memory for a thread. It will allocate the shared memory for a block. For copying the values from global memory to shared memory, you have to distribute the task among the threads.

Also, if you need a shared memory with size ‘n’ (in your case 1e8) for each block, it will be beyond the shared memory limit of the GPUs (16KB per multiprocessor – i.e per block).

Depending on how your points are distributed you might be better off using a texture (and thus the texture cache) instead of shared memory.