I am trying to understand how a GPU manages its L2 cache. I did an experiment to confirm when cache is coherent with memory on GPU.
Two kernels are launched on two GPU (GPU0 and GPU1). Each kernel is set to start one thread. The thread on GPU0 reads a value from memory and uses a loop to check whether the value is modified. If the value is changed, it exits the loop. The thread on GPU1 writes a different value to the same element. In case the value is indeed changed, that thread writes 10000000 times in the for loop. āsrcā is allocated on GPU0 by cudaMalloc. Peer access is enabled on both GPU0 and GPU1. The kernel code is shown as below:
global void SimpleKernel(float *src, int gpuid)
{
float a = src[0];
if(gpuid==0) {
while(src[0]==a);
printf(ādata is changed to %f\nā, src[0]);
} else {
for(int i=0; i<10000000; i++) {
src[0] = 1111111;
}
printf(āmodify remote data %f\nā, src[0]);
}
}
Although there is data race, I expected GPU0 could see the changed value. But the result I tested was GPU0 did not receive the new value. It seems GPU0 reads data from L2 cache. When the value is modified, the cache is not invalidated. When the L2 cache would be flushed? Is there any hardware coherence mechanism to guarantee the correctness?
You donāt seem to be considering the GPU L1 cache. The L1 cache is not necessarily coherent with the L2 cache. Perhaps more to the point, in the general case, the compiler is free to optimize loads into registers.
Try marking your pointer with volatile:
$ cat t1508.cu
#include <stdio.h>
__global__ void SimpleKernel(volatile float *src, int gpuid)
{
float a = src[0];
if(gpuid==0) {
while(src[0]==a);
printf("data is changed to %f\n", src[0]);
} else {
for(int i=0; i<1; i++) {
src[0] = 1111111;
}
printf("modify remote data %f\n", src[0]);
}
}
int main(){
float *data;
cudaSetDevice(0);
cudaMalloc(&data, sizeof(float));
cudaMemset(data, 0, sizeof(float));
cudaDeviceEnablePeerAccess(1, 0);
cudaSetDevice(1);
cudaDeviceEnablePeerAccess(0, 0);
cudaSetDevice(0);
SimpleKernel<<<1,1>>>(data, 0);
cudaSetDevice(1);
SimpleKernel<<<1,1>>>(data, 1);
cudaDeviceSynchronize();
cudaSetDevice(0);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_35 -o t1508 t1508.cu
$ CUDA_VISIBLE_DEVICES="2,3" ./t1508
modify remote data 1111111.000000
data is changed to 1111111.000000
$
Without volatile, the above code hangs after the first line of printout, according to my testing on my test setup.
CUDA 10.1, CentOS7, dual K20 in peer-able relationship
Note that Iām not doing any proper CUDA error checking in this code. When you run this code on your machine, if you have trouble, my first recommendation would be to run it with cuda-memcheck
Iām not aware of any method to fix that with hardware, of any kind. Iāve never heard any definition of ācache coherenceā that means that a change in memory (or L2) contents also affects processor register contents. Iām not aware of any processor that works that way. Register contents are only modified when that register is loaded via an instruction (leaving aside things like status and flag registers).
If your expectation is that āNVLink coherenceā implies that the above situation will be resolved, that is incorrect. In short, NVLink coherence is not a replacement for volatile, and cannot be used as a replacement for volatile.
On GPUs, volatile doesnāt mean that the L2 is bypassed. It means that the L1 is bypassed, and it also prevents the kind of compiler optimizations/hazards indicated above. The L2 is never bypassed, so the fact that the code works means that the L2 eventually received the update that was written by the external processor to the local memory.
hello!!
The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache).These optimizations can be disabled using the volatile keyword. This is from āācuda-c-programming-guideāā .
So it is mean the complier will not cache X and Y to private cache(L1 cache) when i mark the variable (X.Y) with volatile?
if my thoughts is wrong . where can I know about compiler optimizations.
To the best of my knowledge āvolatileā in CUDA has the same semantics it has in C/C++: a data object so annotated may be modified by an agent outside the scope of the code in which was declared. As a corollary, any use of it must result in access to the memory location underlying that data object.
So the loop in the example of post #4 would have to be structured similar to this:
LOAD R0, src[0]
S100:
COMPARE R0, R1
BRANCH_IF_NOT_EQUAL S200
[...]
LOAD R0, src[0] // retrieve contents of src[0] on every loop iteration
BRANCH S100
S200:
Classical examples of volatile data objects in a PC environment are memory-mapped hardware status registers and memory locations updated by interrupt service routine.
Applying the āvolatileā modifier to a data object is typically a necessary, but not sufficient condition in situations where multiple agents (e.g. threads, processors) operate on the data object.
Specifically on processors with a cache hierarchy, an access to the memory location underlying a data object may or may not retrieve the latest data stored in physical memory. It depends on what coherency mechanisms are in place between levels of the memory hierarchy. Additional explicit cache-bypassing or cache-invalidating measures may need to be taken to retrieve the latest data.
Where data objects are shared by multiple agents, access to them may also need to be explicitly coordinated between them, often by means of a lock or mutex.
hello!!
if the the variable X with volatile like following .Does the thread just getting the X form shared memory and global memory?
Does thread get the X from L2 cache?
Example: device volatile int X = 1;
or shared volatile int X = 1;