Kernel modification for math/memonly and profiler results Understanding values of dram_reads and gld

RoofTopG · April 6, 2011, 3:26pm

Hi everyone,

I’m looking at how the performance counters change if we vary amount of math/mem accesses in the kernel. However, dram_reads somehow doesn’t match and I’d like to know why…

The source code looks like this:

__global__ void test_kernel(int *gin, int aPitch, int *gout, int kernelId){

    int offset = kernelId * aPitch;

    int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;

int in0 = gin[offset + gIdx];

for(int i=0; i<WORK; i++) {

        in0 = (in0+i)*in0 + i;

    }

    gout[offset + gIdx] = in0;

}

Following the GTC10 presentation on analysis-driven optimization by Paulius, I modified the kernel into math-only and mem-only version as follows:

__global__ void test_kernel_memonly(int *gin, int aPitch, int *gout, int kernelId){

    int offset = kernelId * aPitch;

    int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;

    int in0 = gin[offset + gIdx];

    gout[offset + gIdx] = in0;

}

__global__ void test_kernel_mathonly(int *gin,int aPitch, int *gout, int kernelId, int flag){ //flag==0

    int in0 = 1; 

    for(int i=0; i<WORK; i++) {

        in0 = (in0+i)*in0 + i;

    }

    if (1 == flag*in0) {

        gout[offset + gIdx] = in0;

    }

}

The kernel is launched with a single TB of 256 threads, WORK=1000. Here’s what we get on Tesla C2050, CUDA4.0RC, and computeprof 3.2:

full vs mem-only:

block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (full) → 4 (memonly),

gld_request=8 (both), l1_global_load_miss=8 (both), global_store_transaction=8 (both), dram_reads: ~4294(full kernel) → ~2204 (memonly), dram_writes=52 (both).

#instructions issued ~ 40160 (full) → 144 (memonly)

full vs math-only:

block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (both),

gld_request=0 (both), l1_global_load_miss=0 (both), global_store_transaction=0 (both), dram_reads: ~4294(full kernel) → ~4552 (mathonly), dram_writes=52 (full) → 13(mathonly).

#instructions issued ~ 40160 (full) → 40064 (mathonly)

How is dram_reads related to gld_request and l1_global_load_miss counters? What’s the formula to estimate dram_reads?

How come there’s a 50% drop in dram_reads for memonly kernel? On the other hand, for mathonly the value is the same like for the original, while I’d expect 0, and gld counters are 0…there’s also a large difference between dram_reads and dram_writes values, while each thread is reading 1 data element and writing one data element. What could be the reason for this?

As a next step, I wanted to transform the original kernel into a bandwidth-bound kernel in order to make some further experiments.

I’ve been wondering if there’s a simple trick to vary the character of a kernel from bandwidth-bound to instruction-bound with a single parameter (or two)?

Looking forward to ideas! :)

alandge · April 11, 2011, 9:42am

RoofTopG,

Which GPU are you using? Is the GPU executing the above experiment connected to the display?

In case the GPU is connected to the display, then you may see incorrect values for the dram counters.

RoofTopG · April 11, 2011, 10:16am

Hi alandge, we’ve got a Tesla C2050 GPU, and the machine is accessed remotely (ssh -X) to start the visual profiler.

Do you have any idea what the right value of “dram_reads” should be?

paulius · April 11, 2011, 11:55pm

C2050 has an L2, so a kernel can potentially hit its data in L2, avoiding the trip to DRAM. So, if you launch the same kernel twice back-to-back, with the same arguments, it’s possible that the second kernel will have fewer DRAM reads (but you should see approximately the same number of L2 read requests).

Now, your DRAM signal counts are strangely high. I wrote a tiny test-case, and on my C2050 I’m getting the expected signal counts (CUDA 4.0 RC2 toolkit, driver, and profiler). Would you mind running the code and reporting what you get for the following counters:

gld/gst requests
L1 gmem hits and misses
L2 read and write requests and misses
DRAM reads and writes

Also, the profiler team have added a number of new features to the profiler since GTC 10. A more updated version of the presentation you’re referring was given at SC10, slides are here:
[url=“CUDA Zone - Library of Resources | NVIDIA Developer”]http://www.nvidia.com/object/sc10_cuda_tutorial.html[/url]
There have been further improvements added in the profiler since then. For example, counting the number the number of gmem access instructions by access size (8,16,32,64,128 bits). See the documentation for the 4.0 for more details.
test_signals.cu (2.15 KB)

RoofTopG · April 13, 2011, 2:25pm

Hi paulius, thanks for the test code. Here are the results with small dram read values for C2050, CUDA 4.0RC2, computeprof 4, monitor disconnected (thanks alandge!).

I couldn’t attach a .csv file, so here’s a picture:

Do we have one DRAM read for each L2 miss? The values I get look a bit different, but then maybe there’s some scaling factor.

It’d help a lot if you can explain dram reads calculation.

We’ve got 256 threads on SM requesting 4B each, that’s 8 gld_requests per SM (128B transaction size), l1_global_load_miss is also 8 (fine),

l2_read_requests = l1_global_load_miss * 4 (because of 32B transaction size) = 32. Now, I’d think that #dram_reads must be less than l2_read_requests,

but for example, in this test run we get 82 L2 read misses, and 97 dram reads for the full kernel.

paulius · April 13, 2011, 6:53pm

Yes, L2/DRAM signals count 32B segments, so a perfectly coalesced 4B read by a warp (a single gld request for 128 bytes) will amount to 4 L2 read requests. Your counts are mostly in line with what I see on my system, with the exception of DRAM reads for the math-only version - you get 347, I get 10. Do you consistently get a number that high?

I believe the profiler counts for L2 read/write misses include some bookkeeping accesses, that’s why they’re higher than requests (and that’s why you also see non-0s for the math-only version). Do you have ECC turned on? DRAM counts also include accesses to do ECC calculations as well.

Here are the counters I see, by the way (ECC is on):

RoofTopG · April 20, 2011, 1:54pm

No, my numbers vary a lot… Here’s a new set of results with ECC off:

Oh, that’s why you have 8 L2 misses in math only even if you had 0 read requests? Is there some rule of a thumb how much to factor in for these overheads?

I’ve turned the ECC off, but the L2 misses and dram reads still don’t look correlated at all.

For example, have a look at the mem only version:

11 L2 misses, 91 dram reads

39 L2 misses, 3 dram reads,

96 L2 misses, 64 dram reads.

Is this normal? Shouldn’t there be 1 DRAM read per 1 L2 miss (now that ECC is off)? Well, it’d help a lot if you can explain how to calculate # DRAM reads on paper first :)

Thanks!!!

Topic		Replies	Views
L2 read/write misses greater than requests CUDA Programming and Performance	11	3033	May 11, 2011
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	6995	November 2, 2010
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1862	February 3, 2012
Yet another performance question CUDA Programming and Performance	16	4229	February 12, 2009
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17300	June 8, 2010
Measuring DRAM throughput CUDA Programming and Performance	12	8624	October 11, 2019
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1446	August 28, 2019
Device to L2 memory less than L2 to L1 CUDA Programming and Performance gpu	19	1080	October 16, 2023
Bug in basic arithmetic on M2050 CUDA Programming and Performance	9	10609	November 29, 2010
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4842	April 10, 2009

Kernel modification for math/memonly and profiler results Understanding values of dram_reads and gld

Related topics