Hi all!
Recently I’ve looked through presentation “GPU Performance Analysis and Optimization” on GPU Technology conference 2012 (http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf). There I’ve found a way to read data by small portions that will increase bus utilization. It would really improve performance for my main algorithm for sparse matrix algebra. I’ve created the test (main.cu attached to post and compilation script compil). It reads the vector with stride 64 to smaller vector. I’ve tried to compile it with -dlcm=cg (cg option for compile script) and without this option (-dlcm=ca by default) on GeForce 690 Gtx I’ve got 11Gb/s for both cases:
-dlcm=ca version
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info : Function properties for _Z6kernelPiS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.17 Gb/s
-dlcm=cg version
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6kernelPiS_' for 'sm_30'
ptxas info : Function properties for _Z6kernelPiS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 336 bytes cmem[0]
the copy cycle runtime: 1.40 ms, bandwidth: 11.18 Gb/s
The codes from cuobjdump --dump-sass differs just in one line as it was expected:
23c23
< /*0060*/ /*0x00601c8584000000*/ LD.E R0, [R6];
---
> /*0060*/ /*0x00601d8584000000*/ LD.E.CG R0, [R6];
47c47
with SHIFT=1 (see main.cu) code gives ~133 Gb/s so the code seems to be working okay. But for stride access I see now difference when enabling non-caching read.
What could be the problem, folks? May be I misunderstand the presentation?
code:
main.cu:
#include <math.h>
#include <stdio.h>
#include <cuda_runtime.h>
#define NTH 128
#define N_BLOCK (8192*2)
#define SHIFT 1//64
__global__ void kernel(int *in, int *out)
{
int ind;
ind=threadIdx.x+blockDim.x*blockIdx.x;
out[ind]=in[SHIFT*ind];
}
int main ( int argc, char * argv [] )
{
cudaEvent_t start,stop;
float del_ti;
int peri1;
int *cu_in;
int *cu_out;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMalloc((void **)&cu_in,(size_t)(NTH*N_BLOCK*SHIFT)*sizeof(cu_in[0]));
cudaMalloc((void **)&cu_out,(size_t)(NTH*N_BLOCK)*sizeof(cu_out[0]));
cudaThreadSynchronize();
cudaEventRecord(start,0);
for(peri1=0;peri1<100;peri1++) kernel<<<N_BLOCK,NTH>>>(cu_in,cu_out);
cudaThreadSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&del_ti,start,stop);
printf("the copy cycle runtime: %.2f ms, bandwidth: %.2f Gb/s\n",del_ti/100.0,double(sizeof(int)*(2*NTH*N_BLOCK))/(del_ti/100.0)/double(1024*1024*1024)*1000.0);
if(cu_in)cudaFree(cu_in);
if(cu_out)cudaFree(cu_out);
return 0;
}
#!/bin/bash
rm -f test
rm -f main.o
if [ "$1" == "cg" ]
then
echo "-dlcm=cg version"
cg="-Xptxas -dlcm=cg"
name=code_cg
else
name=code_ca
echo "-dlcm=ca version"
fi
nvcc -w -Xptxas -v -arch=sm_30 -c -O2 $cg main.cu -o main.o
gcc main.o -L/usr/local/cuda-5.0/lib64 -lcusparse -lcudart -o test
cuobjdump --dump-sass test > $name.asm
./test