problem with non-caching read (-dlcm=cg)

fadeyda · September 11, 2013, 3:51pm

Hi all!
Recently I’ve looked through presentation “GPU Performance Analysis and Optimization” on GPU Technology conference 2012 (http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf). There I’ve found a way to read data by small portions that will increase bus utilization. It would really improve performance for my main algorithm for sparse matrix algebra. I’ve created the test (main.cu attached to post and compilation script compil). It reads the vector with stride 64 to smaller vector. I’ve tried to compile it with -dlcm=cg (cg option for compile script) and without this option (-dlcm=ca by default) on GeForce 690 Gtx I’ve got 11Gb/s for both cases:

-dlcm=ca version
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.17 Gb/s
-dlcm=cg version
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.18 Gb/s

The codes from cuobjdump --dump-sass differs just in one line as it was expected:

23c23
<       /*0060*/     /*0x00601c8584000000*/     LD.E R0, [R6];
---
>       /*0060*/     /*0x00601d8584000000*/     LD.E.CG R0, [R6];
47c47

with SHIFT=1 (see main.cu) code gives ~133 Gb/s so the code seems to be working okay. But for stride access I see now difference when enabling non-caching read.

What could be the problem, folks? May be I misunderstand the presentation?

code:
main.cu:

#include <math.h>
#include <stdio.h>
#include <cuda_runtime.h>
#define NTH 128
#define N_BLOCK (8192*2) 
#define SHIFT 1//64

__global__ void kernel(int *in, int *out)
{
	int ind;
	ind=threadIdx.x+blockDim.x*blockIdx.x;
	out[ind]=in[SHIFT*ind];
}


int main ( int argc, char *  argv [] )
{
	cudaEvent_t start,stop;
	float del_ti;
	int peri1;
	int *cu_in;
	int *cu_out;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaMalloc((void **)&cu_in,(size_t)(NTH*N_BLOCK*SHIFT)*sizeof(cu_in[0]));
	cudaMalloc((void **)&cu_out,(size_t)(NTH*N_BLOCK)*sizeof(cu_out[0]));
	cudaThreadSynchronize();
	cudaEventRecord(start,0);
	for(peri1=0;peri1<100;peri1++)	kernel<<<N_BLOCK,NTH>>>(cu_in,cu_out);
	cudaThreadSynchronize();
	cudaEventRecord(stop,0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&del_ti,start,stop);
	printf("the copy cycle runtime: %.2f ms, bandwidth: %.2f Gb/s\n",del_ti/100.0,double(sizeof(int)*(2*NTH*N_BLOCK))/(del_ti/100.0)/double(1024*1024*1024)*1000.0);
	if(cu_in)cudaFree(cu_in);
	if(cu_out)cudaFree(cu_out);
	return 0;  
}

#!/bin/bash
rm -f test
rm -f main.o
if [ "$1" == "cg" ]
then
	echo "-dlcm=cg version"
	cg="-Xptxas -dlcm=cg"
	name=code_cg
else
	name=code_ca
	echo "-dlcm=ca version"
fi
nvcc -w -Xptxas -v -arch=sm_30 -c -O2 $cg main.cu -o main.o
gcc main.o -L/usr/local/cuda-5.0/lib64 -lcusparse -lcudart -o test
cuobjdump --dump-sass test > $name.asm
./test

njuffa · September 11, 2013, 4:17pm

A vector copy is an entirely memory-bound task without data re-use in which caching effects would play a minor role, if at all. I haven’t checked the presentation, but I assume it talks about the granularity of access in cases where access is (mostly) contiguous across threads.

Contiguous access fully utilizes the wide memory interface of GPUs to full advantage. With strided access, only a portion of the wide interface is utilized in each clock cycle. I think the profiler provides a metric called something like “global load efficiency” that measures this effect.

I have not performed this experiment in a number of years, but I think you will find that when you measure the throughput for strides of 1, 2, …, N elements (where each element is a 32-bit int in this case), you will see a continuous drop from the maximum until a minimum is reached at some stride beyond which the throughput does not drop further. Your strided case would appear to be close to (or at) the minimum already.

fadeyda · September 11, 2013, 11:35pm

I mean something different njuffa. In the presentation it is said that I can read not huge aligned continuous portions of data of 128 bytes for each of my strided value but smaller portions equal to 32 bytes. So utilization supposed to be ~4/32 for int. See page 31 and compare pages 44 and 45. What you are discussing is on p. 44. basically there is always a latency - I will always get unused values from global but with -dlcm=cg it is expected to have not so dramatic consequences.

Greg · September 12, 2013, 12:32am

fadeyda,

On compute capability 3.0 and 3.5 devices global memory loads are not cached in L1; therefore, LD.CA == LD.CG on Kepler. If you use the profiling tools you will see the optimization that you have described above. Specifically, a global load to L1 will request the minimum number of 32B sectors to fulfull the load. On compute capability 2.* LD.CA (default) will request 4 32B sectors for every cache line miss.

fadeyda · September 12, 2013, 7:20am

Thnx Greg, can i see this effect on GTX 285 cc=1.3 or i need Fermi for that? in the presentation there was told that the trick works on Kepler. that is why I tried only on Kepler. Unfortunately we have no normal Fermi card except crappy mobile 520 on DDR3.

fadeyda · September 13, 2013, 7:52am

By the way, Greg, how can you explain that on GTX 285 the code above gives 19 Gb/s!

Topic		Replies	Views
block-strided access problem CUDA Programming and Performance	1	884	September 13, 2013
Strided thread access with data reuse in cache lines CUDA Programming and Performance	4	1744	July 14, 2020
why gld_efficiency only 50% when using L1 cache in Pascal CUDA Programming and Performance	0	547	September 21, 2018
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1676	April 7, 2024
Disabling cache and positive L1 throughput CUDA Programming and Performance	5	473	April 22, 2024
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1816	January 5, 2017
GMEM loads: caching vs. non-caching Legacy PGI Compilers	4	5358	February 10, 2014
Optimization Issues CUDA Programming and Performance	6	1001	September 21, 2011
nvcc -Xptxas -dlcm=ca makes program slower than cg CUDA Programming and Performance	0	1575	October 22, 2018
How does cuda global memory's L1 caching work CUDA Programming and Performance	5	1056	July 12, 2024

problem with non-caching read (-dlcm=cg)

Related topics