problem with non-caching read (-dlcm=cg)

Hi all!
Recently I’ve looked through presentation “GPU Performance Analysis and Optimization” on GPU Technology conference 2012 (http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf). There I’ve found a way to read data by small portions that will increase bus utilization. It would really improve performance for my main algorithm for sparse matrix algebra. I’ve created the test (main.cu attached to post and compilation script compil). It reads the vector with stride 64 to smaller vector. I’ve tried to compile it with -dlcm=cg (cg option for compile script) and without this option (-dlcm=ca by default) on GeForce 690 Gtx I’ve got 11Gb/s for both cases:

-dlcm=ca version
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.17 Gb/s
-dlcm=cg version
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.18 Gb/s

The codes from cuobjdump --dump-sass differs just in one line as it was expected:

23c23
<       /*0060*/     /*0x00601c8584000000*/     LD.E R0, [R6];
---
>       /*0060*/     /*0x00601d8584000000*/     LD.E.CG R0, [R6];
47c47

with SHIFT=1 (see main.cu) code gives ~133 Gb/s so the code seems to be working okay. But for stride access I see now difference when enabling non-caching read.

What could be the problem, folks? May be I misunderstand the presentation?

code:
main.cu:

#include <math.h>
#include <stdio.h>
#include <cuda_runtime.h>
#define NTH 128
#define N_BLOCK (8192*2) 
#define SHIFT 1//64

__global__ void kernel(int *in, int *out)
{
	int ind;
	ind=threadIdx.x+blockDim.x*blockIdx.x;
	out[ind]=in[SHIFT*ind];
}


int main ( int argc, char *  argv [] )
{
	cudaEvent_t start,stop;
	float del_ti;
	int peri1;
	int *cu_in;
	int *cu_out;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaMalloc((void **)&cu_in,(size_t)(NTH*N_BLOCK*SHIFT)*sizeof(cu_in[0]));
	cudaMalloc((void **)&cu_out,(size_t)(NTH*N_BLOCK)*sizeof(cu_out[0]));
	cudaThreadSynchronize();
	cudaEventRecord(start,0);
	for(peri1=0;peri1<100;peri1++)	kernel<<<N_BLOCK,NTH>>>(cu_in,cu_out);
	cudaThreadSynchronize();
	cudaEventRecord(stop,0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&del_ti,start,stop);
	printf("the copy cycle runtime: %.2f ms, bandwidth: %.2f Gb/s\n",del_ti/100.0,double(sizeof(int)*(2*NTH*N_BLOCK))/(del_ti/100.0)/double(1024*1024*1024)*1000.0);
	if(cu_in)cudaFree(cu_in);
	if(cu_out)cudaFree(cu_out);
	return 0;  
}
#!/bin/bash
rm -f test
rm -f main.o
if [ "$1" == "cg" ]
then
	echo "-dlcm=cg version"
	cg="-Xptxas -dlcm=cg"
	name=code_cg
else
	name=code_ca
	echo "-dlcm=ca version"
fi
nvcc -w -Xptxas -v -arch=sm_30 -c -O2 $cg main.cu -o main.o
gcc main.o -L/usr/local/cuda-5.0/lib64 -lcusparse -lcudart -o test
cuobjdump --dump-sass test > $name.asm
./test

A vector copy is an entirely memory-bound task without data re-use in which caching effects would play a minor role, if at all. I haven’t checked the presentation, but I assume it talks about the granularity of access in cases where access is (mostly) contiguous across threads.

Contiguous access fully utilizes the wide memory interface of GPUs to full advantage. With strided access, only a portion of the wide interface is utilized in each clock cycle. I think the profiler provides a metric called something like “global load efficiency” that measures this effect.

I have not performed this experiment in a number of years, but I think you will find that when you measure the throughput for strides of 1, 2, …, N elements (where each element is a 32-bit int in this case), you will see a continuous drop from the maximum until a minimum is reached at some stride beyond which the throughput does not drop further. Your strided case would appear to be close to (or at) the minimum already.

I mean something different njuffa. In the presentation it is said that I can read not huge aligned continuous portions of data of 128 bytes for each of my strided value but smaller portions equal to 32 bytes. So utilization supposed to be ~4/32 for int. See page 31 and compare pages 44 and 45. What you are discussing is on p. 44. basically there is always a latency - I will always get unused values from global but with -dlcm=cg it is expected to have not so dramatic consequences.

fadeyda,

On compute capability 3.0 and 3.5 devices global memory loads are not cached in L1; therefore, LD.CA == LD.CG on Kepler. If you use the profiling tools you will see the optimization that you have described above. Specifically, a global load to L1 will request the minimum number of 32B sectors to fulfull the load. On compute capability 2.* LD.CA (default) will request 4 32B sectors for every cache line miss.

Thnx Greg, can i see this effect on GTX 285 cc=1.3 or i need Fermi for that? in the presentation there was told that the trick works on Kepler. that is why I tried only on Kepler. Unfortunately we have no normal Fermi card except crappy mobile 520 on DDR3.

By the way, Greg, how can you explain that on GTX 285 the code above gives 19 Gb/s!