I have formulate a simple sample. In this case the results are not the same as I have mentioned in my original post. But still the theoretical number of cache misses does not tally with the output from the profiler (nvprof).
The sample kernel is for vector addition.
kernel code
global void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
It seems like that when I use cudaMemcpy to copy the arrays from host to device, all the arrays get stored in L2 cache. Therefore, all the reads become cache hits.
My next question is, I also copy array C from host to device. But all the writes to C become cache misses? Is this the expected behavior?
In which situations I can expect write cache hits?