block-strided access problem

Hi, all
I have the following problem. I’ve tried to measure bandwidth for strided int read. I used stride 64 (256 bytes) and got ~10Gb/s on GF285. I have 32 byte access according to nvprof

======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: kernel(int*, int*)
                          100         0         0         0  gld_128b
                          100         0         0         0  gld_64b
                          100    104960    104960    104960  gld_32b
======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: kernel(int*, int*)
                          100         0         0         0  gst_128b
                          100         0         0         0  gst_64b
                          100    104960    104960    104960  gst_32b

Cant understand why I have 104960 transactions since i have 1 transaction per one int. With 128 treads in block and 8192 blocks i expected 1048576 transactions. But let’s continue…
I switched to another access pattern. To test that i calculate it correctly i’ve even printed it out. So you can see the pattern (imagine that i is a tid=threadIdx.x+blockDim.x*blockIdx.x and the number is an actual address):

i=  0   i=  1   i=  2   i=  3   i=  4   i=  5   i=  6   i=  7
i=512   i=513   i=514   i=515   i=516   i=517   i=518   i=519
i=1024  i=1025  i=1026  i=1027  i=1028  i=1029  i=1030  i=1031
i=1536  i=1537  i=1538  i=1539  i=1540  i=1541  i=1542  i=1543
i=2048  i=2049  i=2050  i=2051  i=2052  i=2053  i=2054  i=2055
i=2560  i=2561  i=2562  i=2563  i=2564  i=2565  i=2566  i=2567
...

For this exact pattern bus utilization is expected to be 100% (8 (items) * 4 bytes (sizeof(int)) = 32 bytes) but i see the same ~10Gb/s. nvprof reports:

======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: kernel(int*, int*)
                          100         0         0         0  gld_128b
                          100         0         0         0  gld_64b
                          100     13120     13120     13120  gld_32b
======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: kernel(int*, int*)
                          100         0         0         0  gst_128b
                          100         0         0         0  gst_64b
                          100     13120     13120     13120  gst_32b

Still in confusion with absolute value but as you see 104960/13120=8. So the number of memory operations is really decreased by the factor of 8 and runtime remains the same!!!
I don’t know how to paste the code so pasting it directly to the body of message:
main.cu:

#include <math.h>
#include <stdio.h>
#include <cuda_runtime.h>
#define NTH 128
#define N_BLOCK (8192) 
#define UTIL 8 
#define SHIFT (64*UTIL)
__global__ void kernel(int *in, int *out)
{
        int ind;
        ind=(threadIdx.x+blockDim.x*blockIdx.x);
        ind=(ind/UTIL)*SHIFT+ind%UTIL;
        out[ind]=in[ind];
}
int main ( int argc, char *  argv [] )
{
        int i;
        for(i=0;i<128;i++) printf("i=%3d%c",(i/UTIL)*SHIFT+i%UTIL,(((i+1)%8)==0)?'\n':'\t');
        cudaEvent_t start,stop;
        float del_ti;
        int peri1;
        int *cu_in;
        int *cu_out;
        cudaEventCreate(&start);
        cudaEventCreate(&stop);
        cudaMalloc((void **)&cu_in,(size_t)(NTH*N_BLOCK*SHIFT/UTIL)*sizeof(cu_in[0]));
        cudaMalloc((void **)&cu_out,(size_t)(NTH*N_BLOCK*SHIFT/UTIL)*sizeof(cu_out[0]));
        cudaThreadSynchronize();
        cudaEventRecord(start,0);
        for(peri1=0;peri1<100;peri1++)  kernel<<<N_BLOCK,NTH>>>(cu_in,cu_out);
        cudaThreadSynchronize();
        cudaEventRecord(stop,0);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&del_ti,start,stop);
        printf("the copy cycle runtime: %.2f ms, bandwidth: %.2f Gb/s\n",del_ti/100.0,double(sizeof(int)*(2*NTH*N_BLOCK))/(del_ti/100.0)/double(1024*1024*1024)*1000.0);
        if(cu_in)cudaFree(cu_in);
        if(cu_out)cudaFree(cu_out);
        return 0;  
}

The access pattern is controlled by UTIL and STRIDE. UTIL is a size of portion, STRIDE is a distance between portions of data. I use the following compile and test-run script (should work if you have standard cuda install path):

#!/bin/bash
rm -f test
rm -f main.o
if [ "$1" == "cg" ]
then
        echo "-dlcm=cg version"
        cg="-Xptxas -dlcm=cg"
        name=code_cg
else
        name=code_ca
        echo "-dlcm=ca version"
fi
nvcc -w -Xptxas -v -arch=sm_13 -c -O2 $cg main.cu -o main.o
gcc main.o -L/usr/local/cuda-5.0/lib64 -lcusparse -lcudart -o test
cuobjdump --dump-sass test > $name.asm
./test
nvprof --events gld_32b,gld_64b,gld_128b ./test
nvprof --events gst_32b,gst_64b,gst_128b ./test

I’d like to update my post for GTX 690 Kepler GK104.
The same testing gave me:
16Gb/s for reading block of 8 ints (UTIL=8, STRIDE=64*UTIL). nvprof says:

Kernel: kernel(int*, int*)
                          100         0         0         0  gld_inst_8bit
                          100         0         0         0  gld_inst_16bit
                          100   1048576   1048576   1048576  gld_inst_32bit
                          100         0         0         0  gld_inst_64bit
                          100         0         0         0  gld_inst_128bit
                          100         0         0         0  local_load_transactions
                          100         0         0         0  l1_local_load_hit
                          100         0         0         0  l1_local_load_miss
                          100    131072    131072    131072  uncached_global_load_transaction

I’m happy with 131072 uncached transactions since 131072*8(UTIL)=8192(blocks)128(threads).
For UTIL=1 I have 5 GB/s and same profile except 1048576 uncached global load transactions which is okay (1048576=128
8192).
But again I can’t understand why i see 16/5~3 speed-up but not 8!
You can suggest launch overhead but STRIDE=1 UTIL=1 gives 130 Gb/s not 190 but it’s quite okay. previous runs were much slower thus launch overhead there was not critical.