Memory access efficiency with sequential read/write.

Hello.
I have a code that (with some details omitted) looks like this:

__global__ void do_stuff(const char *input, char *output, other_params)
{
	local_state=something; 
	for(int i=0; i < some_length; i++)
	{
		output[i],local_state=some_actions(input[i], local_state);
	}
}

I am interested in a following question: is there a way to increase the execution speed of this kernel by reading sequential blocks of memory simultaneously (with memcpy, or other function that I don’t know of)?

__global__ void do_stuff_sequential_access(const char *input, char *output, other_params)
{
	char input_output; // like a local "cache" to reduce number of accesses to global memory
	local_state=something; 
	
	for(int i=0; i < some_length; i+=SIZE) // assume that some_length is divisible by SIZE
	{
		memcpy(input_output, input+i, SIZE*sizeof(char)); 
		for(int j=0; j<SIZE; j++)
		{
			input_output[j],local_state=some_actions(input_output[j], local_state);
		}
		
		memcpy(output+i, input_output, SIZE*sizeof(char));
	}
}

I.e, try to read/write “a lot” of memory with one “big” operation, not many “small” operations?

Try precaching what is in global memory to shared memory, do computations on it and then finally distribute the outcome back to global memory. Using memcpy isn’t that good as it would seem, at least to my experiance. Be wary though - with shared memory bank conflicts may occure.

MK

Thank you for your reply, cmaster.matso!
The problem is, that I only can use local memory (or shared, but only if there is one thread per block), because threads do not have any work to collaborate on. If memcpy isn’t efficient, than I have another question - will prefetching global memory (by that I mean, read sequential block of memory, perform calculation, output the resulting block) help SM schedule reads and writes?

  1. Interleave your data so that for input, bytes 0-3 can be processed by thread 0, bytes 4-7 can be processed by thread 1, bytes 8-11 can be processed by thread 2, etc.
  2. In your processing loop, read in 4 bytes at a time per thread, but use a 4-byte quantity to do the read, such as unsigned int. One approach is to compute appropriate byte offsets into your input array, then cast that pointer to an unsigned int pointer, then do the read of input.
  3. Each thread will then have 4 bytes to process (cast the unsigned int value read from input back to a sequence of 4 char)
  4. Update the input pointer to point to the next interleaved block, and repeat the process.

This should run faster.

If you are unable to interleave the input pointer data, you can still derive some benefit by causing each thread to read a 4, 8, or possibly 16 byte quantity (vector type), instead of doing single byte reads. The approach could be similar, casting your pointer back and forth between char * and int4 *, for example. This should give you a small benefit over reading individual bytes at a time.

I don’t know much about SM scheduling but shared memory has much lower latency then global. Thus using it properly can make things go faster, a specially when there is lots of read/write operations originally performed on global memory. Here’s a link to stackoverflow topic about latency of different memories - hope that will be helpful.

MK