What does gld_transaction mean in nvprof metrics?

jiazhe · July 22, 2016, 6:53am

Hi, everyone.

I wanted to test texture cache line size on Pascal GPU, so I wrote some simple code to read a 1D array to register and write to another array.

__global__ void cacheLineTest(const float* src_, float* des_, unsigned int stride){
  int tid = blockIdx.x*blockDim.x+threadIdx.x*stride;
  des_[tid] = src_[tid];
}

and I got following results.
External Media

I have three questions:

From these results, can I tell that 1D texture cache line of Pascal is 32 bytes?
What does gld_transaction really mean? Why gld_transaction differ from L2_tex_read_transaction when stride is 1 and 3.
I neither use restrict nor ldg(), why my load request still went through Texture + L2 path.

Thanks

jiazhe · July 25, 2016, 6:15am

Could anyone from nvidia tell us how to calculate gld_transaction?

Thanks

harryz · July 25, 2016, 10:01am

Hello jiazhe,

Sorry for the delay, I’m also not quite sure the details of each metrics, could you tell me which cuda version you use and I can raise a bug for dev to answer you.

Best Regards
Harryz

jiazhe · July 26, 2016, 4:27am

Hi Harryz！
Thanks for your reply！I am using the newest CUDA 8.0， thanks！

Best，
Zhe

harryz · July 28, 2016, 9:28am

Hello jiazhe,

Could you attach your full source code? As you said src_ should be sampler1D, dst_ should be global memory, right?

Best Regards

jiazhe · July 31, 2016, 2:37am

Hi harryz_,
Sorry for the late reply. The code is pretty sample.
It’s just moving elements of one array from global mem to another array from global mem.
If something is wrong below, please correct me.

int main(){
  std::cout<<"*********************Cache line Test*********************"<<std::endl;

  int blockSize = 32;
  int gridSize = 1;
  int stride = 9;
  
  unsigned int size = blockSize*gridSize*stride;
  float * A_cpu = (float*)malloc(size*sizeof(float));
  float * B_cpu = (float*)malloc(size*sizeof(float));
  float * A_gpu,*B_gpu;
  
  cudaMalloc(&A_gpu,size*sizeof(float));
  cudaMalloc(&B_gpu,size*sizeof(float));
  
  cacheLineTest<<<gridSize,blockSize,0,0>>> (A_gpu,B_gpu,stride);
  
  free(A_cpu);
  free(B_cpu);
  cudaFree(A_gpu);
  cudaFree(B_gpu);
}

To run code, I use command below.

nvprof -m gld_transactions,dram_read_transactions,l2_read_transactions,l2_tex_read_transactions,local_load_transactions,local_load_transactions,  ./cacheLineTest

Thank you.

harryz · August 1, 2016, 3:20am

Hello jiazhe,

Looks like the same issue described in https://devtalk.nvidia.com/default/topic/941880/visual-profiler/global-load-transaction-count-when-in-coalesced-memory-access/ , actually I’ve raised a bug for dev to check, your request also will be added into the exist bug.

Best Regards

jiazhe · August 3, 2016, 3:11am

Hi harryz_,
Could you please let us know if you have any update from nvidia?
Thanks.

harryz · August 4, 2016, 3:08am

Sure thing.