Strange CUDA profiler results

Hello CUDA guys,
I have a simple question related to CUDA profiler (I use CUDA visual profiler in fact). I’m testing a simple SAXPY operation. I wanted to perform simple test of explicit 128-bit vs 32-bit memory read difference. I’m using two simple kernels:

  1. (coalesced 32-bit global memory reads)
    extern “C”
    global void SaxpyKern(float* target, float* vec1, float* vec2, float alpha) {
    int id = (blockIdx.x * blockDim.x) + threadIdx.x;
    float result;
    result = vec1[id] + alpha*vec2[id];
    target[id] = result;
    }

  2. (coalesced explicit 128-bit memory read)
    extern “C”
    global void SaxpyKern4(float4* target, float4* vec1, float4* vec2, float alpha) {
    int id = (blockIdx.x * blockDim.x) + threadIdx.x;
    float4 result;
    result.x = vec1[id].x + alphavec2[id].x;
    result.y = vec1[id].y + alpha
    vec2[id].y;
    result.z = vec1[id].z + alphavec2[id].z;
    result.w = vec1[id].w + alpha
    vec2[id].w;
    target[id] = result;
    }

In both cases, I was trying to perform Saxpy operation on vectors with 32 float elements, which means:
BLOCK_SIZE = 32 (warp size), GRID_SIZE=1 for SaxpyKern
BLOCK_SIZE = 8 (warp size), GRID_SIZE=1 for SaxpyKern4

CUDA visual profiler gives me strange numbers:
For SaxpyKern (32-bit):
gld_coherent = 4
gst_coherent = 8
For SaxpyKern4 (128-bit):
gld_coherent = 4
gst_coherent = 16

Those numbers seems strange to me:
32*2 = 64 float numbers should be read in case of Saxpy, even if they were coalesced for 128-bit read operations, there should be 64/4 = 16 global read operations, why CUDA profiler says only “4”? And fore global memory store, it says only “8”, which is not enough too. Measured numbers for SaxpyKern4 are also strange, why 4 read operations from global memory? I don’t understand that, could anyone help please?

And another question, I performed time measurements on much more lengthy vectors (4194304 elements), mentioned operations took:
745 microseconds - SaxpyKern (BLOCK_SIZE=256, GRID_SiZE = 16384)
1359 microseconds - SaxpyKern4 (BLOCK_SIZE=256, GRID_SiZE = 4096)

Why does 128-bit version SaxpyKern4 took twice as time as simple SaxpyKernel?

Any help is appreciated, thank you…

128-bit reads are slow. See my testing in this post
[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...41&#entry290441[/url]

The profiler doesn’t count the number of float reads, but the number of warp reads I think. Check the profiler documentation to verify. It might make more sense to track gld_coherent and gld_incoherent to check that there are no uncoalesced reads.

Thank for your informations, I will check mentioned post and see your testing. Is there any detailed profiler documentation you spoke about? The only info I have comes from documentation contained in CUDA toolkit. There is only one short txt file speaking about cuda profiler.