I’m looking for a way to express arithmetic intensity of my kernels using some standard measure. Does it make sense to use the Compute-to-Global-Memory-Access ratio, and what is the right way to compute the CGMA?

Here’s a simple example:

```
__global__ void testKernelInt(int *in, int *out){
int tmp;
int adr;
adr = blockDim.x * threadIdx.y + threadIdx.x; //auxiliary arithmetic ops
tmp = in[adr];
//dummy multiply-add
out[adr] = tmp * threadIdx.x + threadIdx.y; // useful arithmetic ops
}
```

What is the CGMA of this kernel?

The number of mem. accesses is clear. I’m curious if the idea is to take the number of arithmetic ops in the CUDA source code, or the actual number of arithmetic instructions instructions in the PTX?

For example, integer mul and add are two instructions in PTX, whereas floating point mul and add are fused into one.

In addition, do you count only ‘useful’ arithmetic ops, or, all arithmetic ops (e.g. address calculation for GM look-up)?

Looking forward to your comments! Tnx, Ana