cache data in shared memory for subsequent calls

I have this app which I beleive it’s best to cache data in shared memory for many many subsequent calls. The scenario is:

  • the cache data never changes

  • It will then have a long running process (loop forever) which just repeatly invoke the compute function from user input

Would someone verify if the below code will work? I am interested to know if s_data will hold the right data across the first kernel call and the subsequential kernel calls…

Thanks in advance. Ben

The code follow:==========================

extern __shared__ s_data[];

__gloabl__ kernelLoadData(datapool){

		// copy data into s_data

	....

	threadIdx.x ...

}

__gloabl__ kernelCompute(signal){

	// access s_data and compute it with signal...

	....

}

void main(){

	// load data from disk:

	float* datapool;

	...

	// compute blockDim and thread count:

	dim3 n_block, block_size;

	// load once to each block:

	kernelLoadData<<<n_block, block_size>>>(datapool);

	float *signal;

	while(true){

		// wait for input

		....

		signal = ....	// some input func

		// compute from the input:

		kernelCompute<<<n_block, block_size>>>(signal);

		// read back from computed data...

	}

}

The contents of shared memory at the beginning of any kernel call are undefined, so no, that will not work.

Perhaps you would want to do copy the data into constant memory space… If all threads of the warp are accessing the same element this might be a good option for you.

In your function main you just do a cudaMemcopyToSymbol(…) to a constant variable that has been declared in the global scope.

constant float read_only_array[length];

Thanks, tmurray and Jimmy.

I have finished the work;). Anyway, what I used was: load the data into global memory and stay cached there. The data will then be read into shared memory for faster access. Both shared memory or constant memory are too small, since I have at least 20MB of float numbers. I’d need 16 GTX470 to load all of them in shared or constant memory;).

The bad news is I cannot cache the 20 MB data in shared memory, since each float will only be used once.

Anyhow, it works. Still way faster than CPU;), roughly by 20 times (40s->2s).

Thanks, tmurray and Jimmy.

I have finished the work;). Anyway, what I used was: load the data into global memory and stay cached there. The data will then be read into shared memory for faster access. Both shared memory or constant memory are too small, since I have at least 20MB of float numbers. I’d need 16 GTX470 to load all of them in shared or constant memory;).

The bad news is I cannot cache the 20 MB data in shared memory, since each float will only be used once.

Anyhow, it works. Still way faster than CPU;), roughly by 20 times (40s->2s).