moving a function from global to shared memory

Hi, the following code I have runs in global memory. My issue at hand, is that when using global memory, there’s a 400-600 clock cycle latency delay, which greatly slows down the computing of my program. I’m not quite sure how to go about transferring my functions here to run in shared memory as opposed to global. Can show me how or point me in the right direction?

__global__ void csourced(double*, double, double, double,

		 double, double, double,

		 double, double, double,

		 double, double,

		 double*, double*, double*);
csourced<<<dimGrid, dimBlock>>>(thd, *crp, *cphip, *ctp,

		 *cdrdt, *cd2rdt2, *cd3rdt3,

		 *cdthdt, *cd2thdt2, *cd3thdt3,

		 *cdphidt, *cd2phidt2,

		 rd, tred, timd);

cudaMemcpy( tre, tred, sizex*sizey, cudaMemcpyDeviceToHost );

  cudaMemcpy( tim, timd, sizex*sizey, cudaMemcpyDeviceToHost );