Why has beginner's experiment to demonstrate L1 cache incoherency failed.

L1 Cache coherency of CUDA blocks

Device global memory is cached in the same L1 cache for all threads in a block, but we are warned that the L1 caches for different blocks can become incoherent.

Putting this together with the fact that data is transferred in 128 Byte “chunks”, I wondered what would
happen if two blocks wrote to different parts of the same chunk of global memory. I expected that
the two L1 caches would become incoherent and the results eventually written to RAM would be indeterminate.

However, my first experiment shows no evidence of this behaviour:

////////////////////// EXPERIMENTS KERNELS START ///////////////////////////////

//assume dimBlock(WARPSIZE, 4), dimGrid(32)
__global__ void Experiment_On_4KBytes_Kernel_A(uint8_t * __restrict__ pBytes)
{
	unsigned offset = 128 * threadIdx.x + 4 * blockIdx.x + threadIdx.y;
	*(pBytes + offset) = offset % 43; //prime number hash
}

//assume 1 thread
__global__ void Experiment_On_4KBytes_Kernel_B(uint8_t * __restrict__ pBytes)
{
	unsigned errs = 0;

	for (unsigned chunk = 0; chunk < 32; ++chunk)
		for (unsigned block = 0; block < 32; ++block)
			for (unsigned word = 0; word < 4; ++word)
			{
				unsigned offset = 128 * chunk + 4 * block + word;
				errs += ( (unsigned) pBytes[offset] != offset % 43);
			}

	printf("EXPERIMENT RESULT: %u\n",errs); //the result was 0, even though 32 different blocks wrote in each chunk of 128 bytes! 
}

////////////////////// EXPERIMENTS KERNELS END  ///////////////////////////////

.
.
.
////////////////////// EXPERIMENTS START ///////////////////////////////
	{
		dim3 dimBlock(WARPSIZE, 4), dimGrid(32);

		Experiment_On_4KBytes_Kernel_A<<<dimGrid, dimBlock, 0, stream>>>(pBytestream);

		Experiment_On_4KBytes_Kernel_B<<<1, 1, 0, stream>>>(pBytestream);
	}
////////////////////// EXPERIMENTS END  ///////////////////////////////

This suggests to me that either my experiment is not valid or there is a feature in Invidia GPUs to prevent
WAW hazards, perhaps using the following idea I found on the internet:

  • Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an “exclusive” on the line, and can read and write the line as if it were the only CPU.
  • reference: inst.eecs.berkeley.edu/~cs194-6/fa08/ppt/lec10.ppt

    Any ideas on the experiment or on the GPU operation would be very helpful.

    What sort of GPU are you running on?

    The gpu is a GeForce GTX 7800 Ti.

    I don’t know what GeForce GTX 7800 Ti is.
    GTX 780 Ti is a Kepler GPU that has L1 disabled (for global loads)

    https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#l1-cache

    Thank you. I did mean 780 Ti, so your posting and the link about L1 being disabled have answered my question.