Why has beginner's experiment to demonstrate L1 cache incoherency failed.

RobertE · March 23, 2015, 11:57am

L1 Cache coherency of CUDA blocks

Device global memory is cached in the same L1 cache for all threads in a block, but we are warned that the L1 caches for different blocks can become incoherent.

Putting this together with the fact that data is transferred in 128 Byte “chunks”, I wondered what would
happen if two blocks wrote to different parts of the same chunk of global memory. I expected that
the two L1 caches would become incoherent and the results eventually written to RAM would be indeterminate.

However, my first experiment shows no evidence of this behaviour:

////////////////////// EXPERIMENTS KERNELS START ///////////////////////////////

//assume dimBlock(WARPSIZE, 4), dimGrid(32)
__global__ void Experiment_On_4KBytes_Kernel_A(uint8_t * __restrict__ pBytes)
{
	unsigned offset = 128 * threadIdx.x + 4 * blockIdx.x + threadIdx.y;
	*(pBytes + offset) = offset % 43; //prime number hash
}

//assume 1 thread
__global__ void Experiment_On_4KBytes_Kernel_B(uint8_t * __restrict__ pBytes)
{
	unsigned errs = 0;

	for (unsigned chunk = 0; chunk < 32; ++chunk)
		for (unsigned block = 0; block < 32; ++block)
			for (unsigned word = 0; word < 4; ++word)
			{
				unsigned offset = 128 * chunk + 4 * block + word;
				errs += ( (unsigned) pBytes[offset] != offset % 43);
			}

	printf("EXPERIMENT RESULT: %u\n",errs); //the result was 0, even though 32 different blocks wrote in each chunk of 128 bytes! 
}

////////////////////// EXPERIMENTS KERNELS END  ///////////////////////////////

.
.
.
////////////////////// EXPERIMENTS START ///////////////////////////////
	{
		dim3 dimBlock(WARPSIZE, 4), dimGrid(32);

		Experiment_On_4KBytes_Kernel_A<<<dimGrid, dimBlock, 0, stream>>>(pBytestream);

		Experiment_On_4KBytes_Kernel_B<<<1, 1, 0, stream>>>(pBytestream);
	}
////////////////////// EXPERIMENTS END  ///////////////////////////////

This suggests to me that either my experiment is not valid or there is a feature in Invidia GPUs to prevent
WAW hazards, perhaps using the following idea I found on the internet:

Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an “exclusive” on the line, and can read and write the line as if it were the only CPU.

reference: inst.eecs.berkeley.edu/~cs194-6/fa08/ppt/lec10.ppt

Any ideas on the experiment or on the GPU operation would be very helpful.

Robert_Crovella · March 23, 2015, 2:39pm

What sort of GPU are you running on?

RobertE · March 24, 2015, 9:54am

The gpu is a GeForce GTX 7800 Ti.

Robert_Crovella · March 24, 2015, 1:58pm

I don’t know what GeForce GTX 7800 Ti is.
GTX 780 Ti is a Kepler GPU that has L1 disabled (for global loads)

[url]https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#l1-cache[/url]

RobertE · March 25, 2015, 1:23pm

Thank you. I did mean 780 Ti, so your posting and the link about L1 being disabled have answered my question.

Topic		Replies	Views
GPU cache coherence problem CUDA Programming and Performance	7	5447	October 31, 2019
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14911	May 20, 2010
changing L1 cache configuration using â€œcudaFuncSetCacheConfig" not working CUDA Programming and Performance	6	4485	February 3, 2012
Get a 100% L1 Cache Hit Rate CUDA Programming and Performance	3	3271	October 12, 2021
GPU coherence problem CUDA Programming and Performance	1	1162	October 31, 2019
Cache L1 and L2 Architecture Kepler CUDA Programming and Performance	2	3175	December 30, 2019
variable cache line width ? CUDA Programming and Performance	4	2006	January 13, 2015
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7150	April 17, 2013
Switch off L1 cache CUDA Programming and Performance	2	3393	March 24, 2015
Is L1 global memory caching on GK110 possible?? CUDA Programming and Performance	0	603	September 6, 2015

Why has beginner's experiment to demonstrate L1 cache incoherency failed.

Related topics