Memory access efficiency with sequential read/write.

anku · March 19, 2014, 7:11pm

Hello.
I have a code that (with some details omitted) looks like this:

__global__ void do_stuff(const char *input, char *output, other_params)
{
	local_state=something; 
	for(int i=0; i < some_length; i++)
	{
		output[i],local_state=some_actions(input[i], local_state);
	}
}

I am interested in a following question: is there a way to increase the execution speed of this kernel by reading sequential blocks of memory simultaneously (with memcpy, or other function that I don’t know of)?

__global__ void do_stuff_sequential_access(const char *input, char *output, other_params)
{
	char input_output; // like a local "cache" to reduce number of accesses to global memory
	local_state=something; 
	
	for(int i=0; i < some_length; i+=SIZE) // assume that some_length is divisible by SIZE
	{
		memcpy(input_output, input+i, SIZE*sizeof(char)); 
		for(int j=0; j<SIZE; j++)
		{
			input_output[j],local_state=some_actions(input_output[j], local_state);
		}
		
		memcpy(output+i, input_output, SIZE*sizeof(char));
	}
}

I.e, try to read/write “a lot” of memory with one “big” operation, not many “small” operations?

cmaster.matso · March 20, 2014, 7:55am

Try precaching what is in global memory to shared memory, do computations on it and then finally distribute the outcome back to global memory. Using memcpy isn’t that good as it would seem, at least to my experiance. Be wary though - with shared memory bank conflicts may occure.

MK

anku · March 20, 2014, 1:21pm

Thank you for your reply, cmaster.matso!
The problem is, that I only can use local memory (or shared, but only if there is one thread per block), because threads do not have any work to collaborate on. If memcpy isn’t efficient, than I have another question - will prefetching global memory (by that I mean, read sequential block of memory, perform calculation, output the resulting block) help SM schedule reads and writes?

Robert_Crovella · March 20, 2014, 1:41pm

Interleave your data so that for input, bytes 0-3 can be processed by thread 0, bytes 4-7 can be processed by thread 1, bytes 8-11 can be processed by thread 2, etc.
In your processing loop, read in 4 bytes at a time per thread, but use a 4-byte quantity to do the read, such as unsigned int. One approach is to compute appropriate byte offsets into your input array, then cast that pointer to an unsigned int pointer, then do the read of input.
Each thread will then have 4 bytes to process (cast the unsigned int value read from input back to a sequence of 4 char)
Update the input pointer to point to the next interleaved block, and repeat the process.

This should run faster.

If you are unable to interleave the input pointer data, you can still derive some benefit by causing each thread to read a 4, 8, or possibly 16 byte quantity (vector type), instead of doing single byte reads. The approach could be similar, casting your pointer back and forth between char * and int4 *, for example. This should give you a small benefit over reading individual bytes at a time.

cmaster.matso · March 20, 2014, 1:42pm

I don’t know much about SM scheduling but shared memory has much lower latency then global. Thus using it properly can make things go faster, a specially when there is lots of read/write operations originally performed on global memory. Here’s a link to stackoverflow topic about latency of different memories - hope that will be helpful.

MK

Topic		Replies	Views
Batch write CUDA Programming and Performance	1	4847	September 22, 2008
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3869	March 2, 2009
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9152	November 12, 2007
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
Shared memory doubt CUDA Programming and Performance	5	4596	June 11, 2008
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4347	May 12, 2015
Output through shared memory CUDA Programming and Performance	6	4141	June 17, 2010
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3183	October 14, 2009
memory optimazation question CUDA Programming and Performance	4	702	July 7, 2016
copying to shared block mem CUDA Programming and Performance	11	4177	April 6, 2008

Memory access efficiency with sequential read/write.

Related topics