Hello I am trying to port an application using Cuda.

I decided to try eth profiler to find out if my interpretation of the documentation is correct, but apparently it is not.

This is a very simple function that calculate the inertia tensor of an array of rigid bodies,

I am calling this with 950 blocks of 16 thread each.

```
// multply two 3 x 3 matroices encoded into a 4 x 4 array
// all dat amust be in shared memory
__device__ void MatrixMultiply(float* const out, const float* const A, const float* const B)
{
int row;
int colum;
int rowScale;
row = threadIdx.x >> 2;
colum = threadIdx.x & 3;
rowScale = row << 2;
out[rowScale + colum] = A[rowScale + 0] * B[0 + colum];
out[rowScale + colum] += A[rowScale + 1] * B[4 + colum];
out[rowScale + colum] += A[rowScale + 2] * B[8 + colum];
}
// this funtil calculale teh expresion A~ * Iinv * A
// A is a 4 x 4 matrix, I is a digonal inertia encoded into a 4 x 1 vector
__global__ void InertiaMatrices (
float* const invInertiaMatrix,
const float* const matrix,
const float* const invInertia)
{
int row;
int colum;
int rowScale;
int block;
int thread;
int matrixIndex;
__shared__ float transp [16];
__shared__ float tmpMatrix [16];
__shared__ float accMatrix [16];
block = blockIdx.x;
thread = threadIdx.x;
// calcuale the matrix indices
row = thread >> 2;
colum = thread & 3;
matrixIndex = (block << 4) + thread;
rowScale = row << 2;
// copy the matrix to local array
tmpMatrix[thread] = matrix[matrixIndex];
// calculate the A~ * Iinv
transp[rowScale + colum] = tmpMatrix[(colum << 2) + row] * invInertia[(block << 2) + colum];
// now calculate (A~ * Iinv) * A by mutiplying two 4 x 4 matrices
MatrixMultiply(accMatrix, transp, tmpMatrix);
// copy the return matrix
invInertiaMatrix[matrixIndex] = accMatrix[thread];
}
void CalculeInertiaMatrices (
void* invInertaMatrixPtr,
void* matrixPtr,
void* invInertiaPtr,
int count)
{
float* const matrix = (float*) matrixPtr;
float* const invInertia = (float*) invInertiaPtr;
float* const invInertaMatrix = (float*) invInertaMatrixPtr;
InertiaMatrices <<<count, 16>>> (invInertaMatrix, matrix, invInertia);
}
```

It runned for 10 secund with 930 blocks, and this is part of the profile output.

Method gputime cputime occupancy warp_serialize

_Z15InertiaMatrices 7.776 20.065 0.25 343

_Z15InertiaMatrices 8.192 22.328 0.25 344

_Z15InertiaMatrices 8.224 22.96 0.25 376

_Z15InertiaMatrices 8.256 21.702 0.25 371

_Z15InertiaMatrices 8.224 22.143 0.25 341

_Z15InertiaMatrices 8.192 22.641 0.25 359

_Z15InertiaMatrices 8.256 22.173 0.25 368

_Z15InertiaMatrices 8.16 21.9 0.25 370

_Z15InertiaMatrices 8.224 21.966 0.25 361

_Z15InertiaMatrices 8.288 21.704 0.25 353

_Z15InertiaMatrices 8.224 22.529 0.25 360

_Z15InertiaMatrices 8.16 21.669 0.25 344

_Z15InertiaMatrices 8.16 22.533 0.25 352

_Z15InertiaMatrices 8.192 21.635 0.25 368

Here are some questions I have.

-my method only uses 16 thread per block, I was expecting an occupancy of 50% but the profiler say I am only using 25%

The documentation say the maximum efficiency is achieved when 32 threads are used for each of the multiprocessors.

It is not that is this is not bad all teh contrary, to me it means I still have room for optimization by packing my data and have each multiprocessor calculating more than one full body at time.

It is something I only do if I need the extra performance.

But I would like to know why the profiler say something different than the user guide, or why I am so wrong so that I do not continue making the same mistake going forward?

-I see the warp serialize column say there are about 350 serializations but the way I see it there should be zero.

The code have no branches, and all data is aligned to 256 bytes boundary, so each read and write should be coalesced For all 16 thread. How can I find out what cases the serialization?

-what do the GPU and CPU time mean, I could not find any information on that.

Thank you