Kernel modification for math/memonly and profiler results Understanding values of dram_reads and gld

Hi everyone,

I’m looking at how the performance counters change if we vary amount of math/mem accesses in the kernel. However, dram_reads somehow doesn’t match and I’d like to know why…

The source code looks like this:

__global__ void test_kernel(int *gin, int aPitch, int *gout, int kernelId){

    int offset = kernelId * aPitch;

    int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;

int in0 = gin[offset + gIdx];

for(int i=0; i<WORK; i++) {

        in0 = (in0+i)*in0 + i;

    }

    gout[offset + gIdx] = in0;

}

Following the GTC10 presentation on analysis-driven optimization by Paulius, I modified the kernel into math-only and mem-only version as follows:

__global__ void test_kernel_memonly(int *gin, int aPitch, int *gout, int kernelId){

    int offset = kernelId * aPitch;

    int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;

    int in0 = gin[offset + gIdx];

    gout[offset + gIdx] = in0;

}

__global__ void test_kernel_mathonly(int *gin,int aPitch, int *gout, int kernelId, int flag){ //flag==0

    int in0 = 1; 

    for(int i=0; i<WORK; i++) {

        in0 = (in0+i)*in0 + i;

    }

    if (1 == flag*in0) {

        gout[offset + gIdx] = in0;

    }

}

The kernel is launched with a single TB of 256 threads, WORK=1000. Here’s what we get on Tesla C2050, CUDA4.0RC, and computeprof 3.2:

full vs mem-only:

block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (full) -> 4 (memonly),

gld_request=8 (both), l1_global_load_miss=8 (both), global_store_transaction=8 (both), dram_reads: ~4294(full kernel) -> ~2204 (memonly), dram_writes=52 (both).

#instructions issued ~ 40160 (full) -> 144 (memonly)

full vs math-only:

block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (both),

gld_request=0 (both), l1_global_load_miss=0 (both), global_store_transaction=0 (both), dram_reads: ~4294(full kernel) -> ~4552 (mathonly), dram_writes=52 (full) -> 13(mathonly).

#instructions issued ~ 40160 (full) -> 40064 (mathonly)

How is dram_reads related to gld_request and l1_global_load_miss counters? What’s the formula to estimate dram_reads?

How come there’s a 50% drop in dram_reads for memonly kernel? On the other hand, for mathonly the value is the same like for the original, while I’d expect 0, and gld counters are 0…there’s also a large difference between dram_reads and dram_writes values, while each thread is reading 1 data element and writing one data element. What could be the reason for this?

As a next step, I wanted to transform the original kernel into a bandwidth-bound kernel in order to make some further experiments.

I’ve been wondering if there’s a simple trick to vary the character of a kernel from bandwidth-bound to instruction-bound with a single parameter (or two)?

Looking forward to ideas! :)

RoofTopG,

Which GPU are you using? Is the GPU executing the above experiment connected to the display?

In case the GPU is connected to the display, then you may see incorrect values for the dram counters.

Hi alandge, we’ve got a Tesla C2050 GPU, and the machine is accessed remotely (ssh -X) to start the visual profiler.

Do you have any idea what the right value of “dram_reads” should be?

C2050 has an L2, so a kernel can potentially hit its data in L2, avoiding the trip to DRAM. So, if you launch the same kernel twice back-to-back, with the same arguments, it’s possible that the second kernel will have fewer DRAM reads (but you should see approximately the same number of L2 read requests).

Now, your DRAM signal counts are strangely high. I wrote a tiny test-case, and on my C2050 I’m getting the expected signal counts (CUDA 4.0 RC2 toolkit, driver, and profiler). Would you mind running the code and reporting what you get for the following counters:

  • gld/gst requests
  • L1 gmem hits and misses
  • L2 read and write requests and misses
  • DRAM reads and writes

Also, the profiler team have added a number of new features to the profiler since GTC 10. A more updated version of the presentation you’re referring was given at SC10, slides are here:
http://www.nvidia.com/object/sc10_cuda_tutorial.html
There have been further improvements added in the profiler since then. For example, counting the number the number of gmem access instructions by access size (8,16,32,64,128 bits). See the documentation for the 4.0 for more details.
test_signals.cu (2.15 KB)

Hi paulius, thanks for the test code. Here are the results with small dram read values for C2050, CUDA 4.0RC2, computeprof 4, monitor disconnected (thanks alandge!).

I couldn’t attach a .csv file, so here’s a picture:

Do we have one DRAM read for each L2 miss? The values I get look a bit different, but then maybe there’s some scaling factor.

It’d help a lot if you can explain dram reads calculation.

We’ve got 256 threads on SM requesting 4B each, that’s 8 gld_requests per SM (128B transaction size), l1_global_load_miss is also 8 (fine),

l2_read_requests = l1_global_load_miss * 4 (because of 32B transaction size) = 32. Now, I’d think that #dram_reads must be less than l2_read_requests,

but for example, in this test run we get 82 L2 read misses, and 97 dram reads for the full kernel.

Yes, L2/DRAM signals count 32B segments, so a perfectly coalesced 4B read by a warp (a single gld request for 128 bytes) will amount to 4 L2 read requests. Your counts are mostly in line with what I see on my system, with the exception of DRAM reads for the math-only version - you get 347, I get 10. Do you consistently get a number that high?

I believe the profiler counts for L2 read/write misses include some bookkeeping accesses, that’s why they’re higher than requests (and that’s why you also see non-0s for the math-only version). Do you have ECC turned on? DRAM counts also include accesses to do ECC calculations as well.

Here are the counters I see, by the way (ECC is on):

No, my numbers vary a lot… Here’s a new set of results with ECC off:

Oh, that’s why you have 8 L2 misses in math only even if you had 0 read requests? Is there some rule of a thumb how much to factor in for these overheads?

I’ve turned the ECC off, but the L2 misses and dram reads still don’t look correlated at all.

For example, have a look at the mem only version:

11 L2 misses, 91 dram reads

39 L2 misses, 3 dram reads,

96 L2 misses, 64 dram reads.

Is this normal? Shouldn’t there be 1 DRAM read per 1 L2 miss (now that ECC is off)? Well, it’d help a lot if you can explain how to calculate # DRAM reads on paper first :)

Thanks!!!