Shared Memory question

FlyingSquirrel · March 9, 2007, 8:20am

Hi,

I have a question related to the shared memory: can we copy block of global memory to the shared memory, or we only able to do it by copying each element in such way:

__global__ void kernel(short *data, int size)

{

                int tid = threadIdx.x;

	extern __shared__ short block[];

                block[tid] = data[tid];

}

Let’s suppose that we want to do an IDCT transform, each block with one thread will perform transformation of the 8x8 block of IDCT coeffs.

void 

extern "C" void runTest(short* data, int width, int height, int block_size)

{	

                int blocks_x = width  / block_size;

	int blocks_y = height / block_size;

	dim3 grid(blocks_x, blocks_y, 1);

	dim3 threads(1, 1, 1);

	idct_kernel<<< grid, threads >>>(d_data, width);

}

__global__ void idct_kernel(short *data, int width)

{

	int x = blockIdx.x;

	int y = blockIdx.y;

	idct(data + y*8*width + x*8, width);

}

If we want to do the transform using the shared memory, we should copy 8x8 block from global to the shared memory. It is possible?

Mark_Harris · March 9, 2007, 10:08am

Because threads run in parallel, the code you have in your first block does exactly what you are asking for.

There are lots of examples of using shared memory in the CUDA SDK.

Mark

FlyingSquirrel · March 9, 2007, 10:30am

I mean, can we do something like this:

__global__ void kernel(short *data, int size)

{

     extern __shared__ short block[];

//we suppose that data size is less than 16KB

     cudaMemcpy(block, data, size, cudaMemcpyDeviceToDevice);

//do something...

}

When I try to compile this code, I’ve got next compiler error message:

Error: Calls are not supported (found non-inlined call to cudaMemcpy)

If I understand it right, element-by-element copying from global to shared memory will be much slower than copying of the one large block.

Cyril_Zeller · March 9, 2007, 12:20pm

cudaMemcpy() is part of the host runtime component, which can only be used by host functions, not device functions (see section 4.5 of the programming guide).

The most efficient way to load data from global memory is to have each thread load separate data in a way that allows the hardware to coalesce each separate load into bigger ones to get maximum memory bandwidth. The simplest way to do so is to have each thread of a warp load successive data elements in global memory, that way each half of the warp does one memory access.

The hw is also capable of reading 64-bit or 128-bit words from global memory in a single instruction with proper data alignment.

Please, look at section 6.1.2.1 of the programming guide for more details and the SDK for practical examples, as suggested by Mark.

You also have to understand that the idea behind CUDA’s massively multi-threaded architecture is to have enough active and math-intensive threads to hide memory latency: While some threads are waiting for data from memory, other threads that process previously loaded data keep the machine busy. See section 6.2, for example, to get more details on how to make sure this happens.

Cyril

FlyingSquirrel · March 9, 2007, 5:43pm

Mark, Cyril, thanks for your answers.

SaurabhSatardekar · November 25, 2016, 4:54pm

hi
i am doing join calculation project using gpu.
i want to read char array of 20 character(string of 20 chars) using a single thread.
i have written code for that but its not working and i don’t get errors can u please help me

Code of kernel:

__global__ void joinCalculation(char GPU_columnFromFirstFile[],char GPU_columnFromSecondFile[],int GPU_resultArray[],int totalLinesInFirstFile,int totalLinesInSecondFile,int columnsNeedToCopiedFromFirstFile,int columnsNeedToCopiedFromSecondFile)
{
	int threadIDx=(blockIdx.x*blockDimx.x)+threadIdx.x;
	int threadIDy=(blockIdx.y*blockDimx.y)+threadIdx.y;
	extern __shared__ char Array[][20];
	
	int index1=0,index2=0,i;
	bool check=false;
	int counter=0;
	//location of 0th character of string to be copied index1=from first file and index2=string from second file

        index1=(int)((((threadID%totalLinesInSecondFile)%blockDimx.x)/columnsNeedToCopiedFromSecondFile)+((threadID/totalLinesInSecondFile)*columnsNeedToCopiedFromSecondFile)*20);

	index2=(int)((((threadID%totalLinesInSecondFile)%blockDimx.x)/columnsNeedToCopiedFromSecondFile)*20);
		
	for(i=0;i<20;i++)
	{
		Array[threadIDy][i]=GPU_columnFromFirstFile[index+i];
		Array[threadIDy][i+20]=GPU_columnFromFirstFile[index+i+20];
	}
	
	__syncthreads();
	
	for(i=0;i<20;i++)
	{
		if(Array[threadIDy][i]==Array[threadIDy][i+20])
			counter++;
	}
	
	__syncthreads();
	
	if(counter==20)
		GPU_resultArray[threadIDx]=1;
}

Topic		Replies	Views
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9355	November 12, 2007
copying to shared block mem CUDA Programming and Performance	11	4315	April 6, 2008
Copying data into shared memory CUDA Programming and Performance	9	3865	July 1, 2009
How copy block to shared memory CUDA Programming and Performance	5	4649	July 17, 2007
Shared memory vs global memory CUDA Programming and Performance	6	3522	April 30, 2007
Copying data from global memory to shared memory by each thread CUDA Programming and Performance	6	17290	January 7, 2022
From Global to Shared Copy some data from Global mem to Shared mem CUDA Programming and Performance	2	3405	November 25, 2011
use thread to copy global to shared memory CUDA Programming and Performance	0	465	December 8, 2013
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5490	February 12, 2009
Transfer a one-dimensional array saved by rows-major from global memory to shared memory CUDA Programming and Performance cuda	1	502	July 1, 2021

Shared Memory question

Related topics