Arithmetic Intesity & Compute to Global Memory Access Ratio How to compute CGMA?

RoofTopG · November 14, 2010, 6:20pm

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:

__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}

What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

RoofTopG · November 14, 2010, 6:20pm

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:

__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}

What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

Crankie · November 15, 2010, 5:35am

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:
__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}
What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

CGMA is usually based on the hardware capabilities and not on the count of operations that you see in the code. In case of GPU, you need to take fused Multiply-Add as one instruction as GPU supports it. Also, you need to take into account the address calculation to obtain a ‘true’ CGMA.

Crankie · November 15, 2010, 5:35am

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:
__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}
What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

CGMA is usually based on the hardware capabilities and not on the count of operations that you see in the code. In case of GPU, you need to take fused Multiply-Add as one instruction as GPU supports it. Also, you need to take into account the address calculation to obtain a ‘true’ CGMA.

RoofTopG · November 25, 2010, 11:23am

It makes sense to me as well! I’d count the actual arithmetic instructions in the PTX code, as I think that is the number of instructions relevant for latency hiding.

However, the book on ‘Programming Massively Parallel Processors’ defines the CGMA as:

“…compute to global memory access (CGMA) ratio, defined as the number of floating-point calculations performed for each access to the global memory within a region of a CUDA program.”

and for the matrix multiply kernel, they say:

"The most important part of the kernel in terms of execution time is the for loop that performs dot product calculations. In every iteration of this loop, two global memory accesses are

performed for one floating-point multiplication and one floating-point addition. Thus, the ratio of floating-point calculation to the global memory access operation is 1 to 1, or 1.0."

I’m really curious about a few questions:

why is the the dot product calculated as “floating-point multiplication and one floating-point addition”, when it is replaced by a fused floating-point MAD?
what to do with integer arithmetics?
how do you define/select the region of a CUDA program; is there a reason not to take the auxiliary arithmetic ops (address calculation) into account? (except for making it easier External Image

of course, in this example there is a for loop in the kernel body, so for large number of iterations the address arithmetic does not contribute much to the total number…but what if you had a rather small kernel (without a for loop), or you have to calculate a new address in every loop iteration?

It’d be great if somebody from NVIDIA could help shed some light on this CGMA calculation External Image

RoofTopG · November 25, 2010, 11:23am

It makes sense to me as well! I’d count the actual arithmetic instructions in the PTX code, as I think that is the number of instructions relevant for latency hiding.

However, the book on ‘Programming Massively Parallel Processors’ defines the CGMA as:

“…compute to global memory access (CGMA) ratio, defined as the number of floating-point calculations performed for each access to the global memory within a region of a CUDA program.”

and for the matrix multiply kernel, they say:

"The most important part of the kernel in terms of execution time is the for loop that performs dot product calculations. In every iteration of this loop, two global memory accesses are

performed for one floating-point multiplication and one floating-point addition. Thus, the ratio of floating-point calculation to the global memory access operation is 1 to 1, or 1.0."

I’m really curious about a few questions:

why is the the dot product calculated as “floating-point multiplication and one floating-point addition”, when it is replaced by a fused floating-point MAD?
what to do with integer arithmetics?
how do you define/select the region of a CUDA program; is there a reason not to take the auxiliary arithmetic ops (address calculation) into account? (except for making it easier External Image

of course, in this example there is a for loop in the kernel body, so for large number of iterations the address arithmetic does not contribute much to the total number…but what if you had a rather small kernel (without a for loop), or you have to calculate a new address in every loop iteration?

It’d be great if somebody from NVIDIA could help shed some light on this CGMA calculation External Image

E.D_Riedijk · November 25, 2010, 12:49pm

I don’t know if it is exactly what you are looking for, but this presentation gives lots of hints and tips to see if you can do something to improve performance. Also calculating instructions per memory load

E.D_Riedijk · November 25, 2010, 12:49pm

I don’t know if it is exactly what you are looking for, but this presentation gives lots of hints and tips to see if you can do something to improve performance. Also calculating instructions per memory load

Topic		Replies	Views
instruction-to-byte ratio CUDA Programming and Performance	0	583	April 5, 2014
How to compute the GFLOPS of a program? CUDA Programming and Performance	15	28241	June 24, 2011
Conceptual questions about maximizing compute intensity CUDA Programming and Performance performance	2	1097	May 29, 2022
A Question from Programming Massively Parallel Processors: A Hands-on Approach CUDA Programming and Performance cuda , kernel	0	354	October 2, 2021
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1792	October 27, 2010
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	4017	April 13, 2023
How to calculate "ideal" ratio of instructions to memory accesses? CUDA Programming and Performance	6	1559	August 23, 2010
coalescing heuristics CUDA Programming and Performance	3	3956	February 26, 2008
Tiled SGEMM showing same AI and same DRAM reads as naive SGEMM in Nsight Compute CUDA Programming and Performance cuda , kernel	1	98	September 2, 2025
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11702	May 23, 2010

Arithmetic Intesity & Compute to Global Memory Access Ratio How to compute CGMA?

Related topics