Hi, all
I have the following problem. I’ve tried to measure bandwidth for strided int read. I used stride 64 (256 bytes) and got ~10Gb/s on GF285. I have 32 byte access according to nvprof
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel(int*, int*)
100 0 0 0 gld_128b
100 0 0 0 gld_64b
100 104960 104960 104960 gld_32b
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel(int*, int*)
100 0 0 0 gst_128b
100 0 0 0 gst_64b
100 104960 104960 104960 gst_32b
Cant understand why I have 104960 transactions since i have 1 transaction per one int. With 128 treads in block and 8192 blocks i expected 1048576 transactions. But let’s continue…
I switched to another access pattern. To test that i calculate it correctly i’ve even printed it out. So you can see the pattern (imagine that i is a tid=threadIdx.x+blockDim.x*blockIdx.x and the number is an actual address):
i= 0 i= 1 i= 2 i= 3 i= 4 i= 5 i= 6 i= 7
i=512 i=513 i=514 i=515 i=516 i=517 i=518 i=519
i=1024 i=1025 i=1026 i=1027 i=1028 i=1029 i=1030 i=1031
i=1536 i=1537 i=1538 i=1539 i=1540 i=1541 i=1542 i=1543
i=2048 i=2049 i=2050 i=2051 i=2052 i=2053 i=2054 i=2055
i=2560 i=2561 i=2562 i=2563 i=2564 i=2565 i=2566 i=2567
...
For this exact pattern bus utilization is expected to be 100% (8 (items) * 4 bytes (sizeof(int)) = 32 bytes) but i see the same ~10Gb/s. nvprof reports:
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel(int*, int*)
100 0 0 0 gld_128b
100 0 0 0 gld_64b
100 13120 13120 13120 gld_32b
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel(int*, int*)
100 0 0 0 gst_128b
100 0 0 0 gst_64b
100 13120 13120 13120 gst_32b
Still in confusion with absolute value but as you see 104960/13120=8. So the number of memory operations is really decreased by the factor of 8 and runtime remains the same!!!
I don’t know how to paste the code so pasting it directly to the body of message:
main.cu:
#include <math.h>
#include <stdio.h>
#include <cuda_runtime.h>
#define NTH 128
#define N_BLOCK (8192)
#define UTIL 8
#define SHIFT (64*UTIL)
__global__ void kernel(int *in, int *out)
{
int ind;
ind=(threadIdx.x+blockDim.x*blockIdx.x);
ind=(ind/UTIL)*SHIFT+ind%UTIL;
out[ind]=in[ind];
}
int main ( int argc, char * argv [] )
{
int i;
for(i=0;i<128;i++) printf("i=%3d%c",(i/UTIL)*SHIFT+i%UTIL,(((i+1)%8)==0)?'\n':'\t');
cudaEvent_t start,stop;
float del_ti;
int peri1;
int *cu_in;
int *cu_out;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMalloc((void **)&cu_in,(size_t)(NTH*N_BLOCK*SHIFT/UTIL)*sizeof(cu_in[0]));
cudaMalloc((void **)&cu_out,(size_t)(NTH*N_BLOCK*SHIFT/UTIL)*sizeof(cu_out[0]));
cudaThreadSynchronize();
cudaEventRecord(start,0);
for(peri1=0;peri1<100;peri1++) kernel<<<N_BLOCK,NTH>>>(cu_in,cu_out);
cudaThreadSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&del_ti,start,stop);
printf("the copy cycle runtime: %.2f ms, bandwidth: %.2f Gb/s\n",del_ti/100.0,double(sizeof(int)*(2*NTH*N_BLOCK))/(del_ti/100.0)/double(1024*1024*1024)*1000.0);
if(cu_in)cudaFree(cu_in);
if(cu_out)cudaFree(cu_out);
return 0;
}
The access pattern is controlled by UTIL and STRIDE. UTIL is a size of portion, STRIDE is a distance between portions of data. I use the following compile and test-run script (should work if you have standard cuda install path):
#!/bin/bash
rm -f test
rm -f main.o
if [ "$1" == "cg" ]
then
echo "-dlcm=cg version"
cg="-Xptxas -dlcm=cg"
name=code_cg
else
name=code_ca
echo "-dlcm=ca version"
fi
nvcc -w -Xptxas -v -arch=sm_13 -c -O2 $cg main.cu -o main.o
gcc main.o -L/usr/local/cuda-5.0/lib64 -lcusparse -lcudart -o test
cuobjdump --dump-sass test > $name.asm
./test
nvprof --events gld_32b,gld_64b,gld_128b ./test
nvprof --events gst_32b,gst_64b,gst_128b ./test