Shared Memory question


I have a question related to the shared memory: can we copy block of global memory to the shared memory, or we only able to do it by copying each element in such way:

__global__ void kernel(short *data, int size)


                int tid = threadIdx.x;

	extern __shared__ short block[];

                block[tid] = data[tid];


Let’s suppose that we want to do an IDCT transform, each block with one thread will perform transformation of the 8x8 block of IDCT coeffs.


extern "C" void runTest(short* data, int width, int height, int block_size)


                int blocks_x = width  / block_size;

	int blocks_y = height / block_size;

	dim3 grid(blocks_x, blocks_y, 1);

	dim3 threads(1, 1, 1);

	idct_kernel<<< grid, threads >>>(d_data, width);


__global__ void idct_kernel(short *data, int width)


	int x = blockIdx.x;

	int y = blockIdx.y;

	idct(data + y*8*width + x*8, width);


If we want to do the transform using the shared memory, we should copy 8x8 block from global to the shared memory. It is possible?

Because threads run in parallel, the code you have in your first block does exactly what you are asking for.

There are lots of examples of using shared memory in the CUDA SDK.


I mean, can we do something like this:

__global__ void kernel(short *data, int size)


     extern __shared__ short block[];

//we suppose that data size is less than 16KB

     cudaMemcpy(block, data, size, cudaMemcpyDeviceToDevice);

//do something...


When I try to compile this code, I’ve got next compiler error message:

Error: Calls are not supported (found non-inlined call to cudaMemcpy)

If I understand it right, element-by-element copying from global to shared memory will be much slower than copying of the one large block.

cudaMemcpy() is part of the host runtime component, which can only be used by host functions, not device functions (see section 4.5 of the programming guide).

The most efficient way to load data from global memory is to have each thread load separate data in a way that allows the hardware to coalesce each separate load into bigger ones to get maximum memory bandwidth. The simplest way to do so is to have each thread of a warp load successive data elements in global memory, that way each half of the warp does one memory access.

The hw is also capable of reading 64-bit or 128-bit words from global memory in a single instruction with proper data alignment.

Please, look at section of the programming guide for more details and the SDK for practical examples, as suggested by Mark.

You also have to understand that the idea behind CUDA’s massively multi-threaded architecture is to have enough active and math-intensive threads to hide memory latency: While some threads are waiting for data from memory, other threads that process previously loaded data keep the machine busy. See section 6.2, for example, to get more details on how to make sure this happens.


Mark, Cyril, thanks for your answers.

i am doing join calculation project using gpu.
i want to read char array of 20 character(string of 20 chars) using a single thread.
i have written code for that but its not working and i don’t get errors can u please help me

Code of kernel:

__global__ void joinCalculation(char GPU_columnFromFirstFile[],char GPU_columnFromSecondFile[],int GPU_resultArray[],int totalLinesInFirstFile,int totalLinesInSecondFile,int columnsNeedToCopiedFromFirstFile,int columnsNeedToCopiedFromSecondFile)
	int threadIDx=(blockIdx.x*blockDimx.x)+threadIdx.x;
	int threadIDy=(blockIdx.y*blockDimx.y)+threadIdx.y;
	extern __shared__ char Array[][20];
	int index1=0,index2=0,i;
	bool check=false;
	int counter=0;
	//location of 0th character of string to be copied index1=from first file and index2=string from second file