Arithmetic Intesity & Compute to Global Memory Access Ratio How to compute CGMA?

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:

__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}

What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:

__global__ void testKernelInt(int *in, int *out){

        int tmp; 

        int adr;

adr = blockDim.x * threadIdx.y + threadIdx.x;      //auxiliary arithmetic ops

tmp = in[adr];

//dummy multiply-add

        out[adr] = tmp * threadIdx.x + threadIdx.y;       // useful arithmetic ops

}

What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana

CGMA is usually based on the hardware capabilities and not on the count of operations that you see in the code. In case of GPU, you need to take fused Multiply-Add as one instruction as GPU supports it. Also, you need to take into account the address calculation to obtain a ‘true’ CGMA.

CGMA is usually based on the hardware capabilities and not on the count of operations that you see in the code. In case of GPU, you need to take fused Multiply-Add as one instruction as GPU supports it. Also, you need to take into account the address calculation to obtain a ‘true’ CGMA.

It makes sense to me as well! I’d count the actual arithmetic instructions in the PTX code, as I think that is the number of instructions relevant for latency hiding.

However, the book on ‘Programming Massively Parallel Processors’ defines the CGMA as:

“…compute to global memory access (CGMA) ratio, defined as the number of floating-point calculations performed for each access to the global memory within a region of a CUDA program.”

and for the matrix multiply kernel, they say:

"The most important part of the kernel in terms of execution time is the for loop that performs dot product calculations. In every iteration of this loop, two global memory accesses are

performed for one floating-point multiplication and one floating-point addition. Thus, the ratio of floating-point calculation to the global memory access operation is 1 to 1, or 1.0."

I’m really curious about a few questions:

  • why is the the dot product calculated as “floating-point multiplication and one floating-point addition”, when it is replaced by a fused floating-point MAD?

  • what to do with integer arithmetics?

  • how do you define/select the region of a CUDA program; is there a reason not to take the auxiliary arithmetic ops (address calculation) into account? (except for making it easier ;)

of course, in this example there is a for loop in the kernel body, so for large number of iterations the address arithmetic does not contribute much to the total number…but what if you had a rather small kernel (without a for loop), or you have to calculate a new address in every loop iteration?

It’d be great if somebody from NVIDIA could help shed some light on this CGMA calculation ;)

It makes sense to me as well! I’d count the actual arithmetic instructions in the PTX code, as I think that is the number of instructions relevant for latency hiding.

However, the book on ‘Programming Massively Parallel Processors’ defines the CGMA as:

“…compute to global memory access (CGMA) ratio, defined as the number of floating-point calculations performed for each access to the global memory within a region of a CUDA program.”

and for the matrix multiply kernel, they say:

"The most important part of the kernel in terms of execution time is the for loop that performs dot product calculations. In every iteration of this loop, two global memory accesses are

performed for one floating-point multiplication and one floating-point addition. Thus, the ratio of floating-point calculation to the global memory access operation is 1 to 1, or 1.0."

I’m really curious about a few questions:

  • why is the the dot product calculated as “floating-point multiplication and one floating-point addition”, when it is replaced by a fused floating-point MAD?

  • what to do with integer arithmetics?

  • how do you define/select the region of a CUDA program; is there a reason not to take the auxiliary arithmetic ops (address calculation) into account? (except for making it easier ;)

of course, in this example there is a for loop in the kernel body, so for large number of iterations the address arithmetic does not contribute much to the total number…but what if you had a rather small kernel (without a for loop), or you have to calculate a new address in every loop iteration?

It’d be great if somebody from NVIDIA could help shed some light on this CGMA calculation ;)

I don’t know if it is exactly what you are looking for, but this presentation gives lots of hints and tips to see if you can do something to improve performance. Also calculating instructions per memory load

http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf

I don’t know if it is exactly what you are looking for, but this presentation gives lots of hints and tips to see if you can do something to improve performance. Also calculating instructions per memory load

http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf