Hello I am trying to port an application using Cuda.
I decided to try eth profiler to find out if my interpretation of the documentation is correct, but apparently it is not.
This is a very simple function that calculate the inertia tensor of an array of rigid bodies,
I am calling this with 950 blocks of 16 thread each.
// multply two 3 x 3 matroices encoded into a 4 x 4 array
// all dat amust be in shared memory
__device__ void MatrixMultiply(float* const out, const float* const A, const float* const B)
{
int row;
int colum;
int rowScale;
row = threadIdx.x >> 2;
colum = threadIdx.x & 3;
rowScale = row << 2;
out[rowScale + colum] = A[rowScale + 0] * B[0 + colum];
out[rowScale + colum] += A[rowScale + 1] * B[4 + colum];
out[rowScale + colum] += A[rowScale + 2] * B[8 + colum];
}
// this funtil calculale teh expresion A~ * Iinv * A
// A is a 4 x 4 matrix, I is a digonal inertia encoded into a 4 x 1 vector
__global__ void InertiaMatrices (
float* const invInertiaMatrix,
const float* const matrix,
const float* const invInertia)
{
int row;
int colum;
int rowScale;
int block;
int thread;
int matrixIndex;
__shared__ float transp [16];
__shared__ float tmpMatrix [16];
__shared__ float accMatrix [16];
block = blockIdx.x;
thread = threadIdx.x;
// calcuale the matrix indices
row = thread >> 2;
colum = thread & 3;
matrixIndex = (block << 4) + thread;
rowScale = row << 2;
// copy the matrix to local array
tmpMatrix[thread] = matrix[matrixIndex];
// calculate the A~ * Iinv
transp[rowScale + colum] = tmpMatrix[(colum << 2) + row] * invInertia[(block << 2) + colum];
// now calculate (A~ * Iinv) * A by mutiplying two 4 x 4 matrices
MatrixMultiply(accMatrix, transp, tmpMatrix);
// copy the return matrix
invInertiaMatrix[matrixIndex] = accMatrix[thread];
}
void CalculeInertiaMatrices (
void* invInertaMatrixPtr,
void* matrixPtr,
void* invInertiaPtr,
int count)
{
float* const matrix = (float*) matrixPtr;
float* const invInertia = (float*) invInertiaPtr;
float* const invInertaMatrix = (float*) invInertaMatrixPtr;
InertiaMatrices <<<count, 16>>> (invInertaMatrix, matrix, invInertia);
}
It runned for 10 secund with 930 blocks, and this is part of the profile output.
Method gputime cputime occupancy warp_serialize
_Z15InertiaMatrices 7.776 20.065 0.25 343
_Z15InertiaMatrices 8.192 22.328 0.25 344
_Z15InertiaMatrices 8.224 22.96 0.25 376
_Z15InertiaMatrices 8.256 21.702 0.25 371
_Z15InertiaMatrices 8.224 22.143 0.25 341
_Z15InertiaMatrices 8.192 22.641 0.25 359
_Z15InertiaMatrices 8.256 22.173 0.25 368
_Z15InertiaMatrices 8.16 21.9 0.25 370
_Z15InertiaMatrices 8.224 21.966 0.25 361
_Z15InertiaMatrices 8.288 21.704 0.25 353
_Z15InertiaMatrices 8.224 22.529 0.25 360
_Z15InertiaMatrices 8.16 21.669 0.25 344
_Z15InertiaMatrices 8.16 22.533 0.25 352
_Z15InertiaMatrices 8.192 21.635 0.25 368
Here are some questions I have.
-my method only uses 16 thread per block, I was expecting an occupancy of 50% but the profiler say I am only using 25%
The documentation say the maximum efficiency is achieved when 32 threads are used for each of the multiprocessors.
It is not that is this is not bad all teh contrary, to me it means I still have room for optimization by packing my data and have each multiprocessor calculating more than one full body at time.
It is something I only do if I need the extra performance.
But I would like to know why the profiler say something different than the user guide, or why I am so wrong so that I do not continue making the same mistake going forward?
-I see the warp serialize column say there are about 350 serializations but the way I see it there should be zero.
The code have no branches, and all data is aligned to 256 bytes boundary, so each read and write should be coalesced For all 16 thread. How can I find out what cases the serialization?
-what do the GPU and CPU time mean, I could not find any information on that.
Thank you