Timing comparison(ms) in calculation of the sum of matrix rows

Good morning,
I try to explain the question as clearly as possible.

I need to calculate the sum of a row of a matrix with the following parameters:

  • N rows (with N= blocks*threads)
  • X columns
    The result has to be an array of N elements containing the summations of each row.
    The code works (I enclose the project), but my perplexity is another: the execution times.
    To get more reliable results, I run 10,000 cycles and calculate the average of the execution times but the results don’t convince me very much.

How is it possible that there is a difference of almost 10% more between one-dim and multi results?

The calls I make to the kernel are in both one-dim and multi, and the times turn out to be these:

####################


ONE-DIM: 8 - 8 MULTI: 4x2 - 4x2

Time ONE-DIM: 0.117128 (ms) Time Multi: 0.094554 (ms)



ONE-DIM: 16 - 16 MULTI: 4x4 - 4x4

Time ONE-DIM: 0.206620 (ms) Time Multi: 0.193303 (ms)



ONE-DIM: 32 - 16 MULTI: 8x4 - 4x4

Time ONE-DIM: 0.220927 (ms) Time Multi: 0.204934 (ms)



ONE-DIM: 16 - 32 MULTI: 4x4 - 8x4

Time ONE-DIM: 0.300701 (ms) Time Multi: 0.274282 (ms)



ONE-DIM: 16 - 32 MULTI: 4x4 - 4x8

Time ONE-DIM: 0.300746 (ms) Time Multi: 0.274242 (ms)



ONE-DIM: 64 - 32 MULTI: 8x8 - 8x4

Time ONE-DIM: 0.562803 (ms) Time Multi: 0.453662 (ms)



ONE-DIM: 64 - 64 MULTI: 8x8 - 8x8

Time ONE-DIM: 1.346480 (ms) Time Multi: 0.874964 (ms)

I understand that there must be a difference between one-dim and multi, but even in the last output the difference is 0.5 ms more, I don’t understand.

// Include
#include <stdio.h>
#include <assert.h>
/*
#define N_BLOCK_ONE_DIM 32
#define N_THREAD_ONE_DIM 32
#define N_BLOCK_MULTI_X 4
#define N_BLOCK_MULTI_Y 8
#define N_THREAD_MULTI_X 4
#define N_THREAD_MULTI_Y 8
*/
#define COLUMNS 10000
#define CICLI 10000
#define verbose 0

int b_one_dim[] = {8, 16, 32, 16, 16, 64, 64};
int t_one_dim[] = {8, 16, 16, 32, 32, 32, 64};
int b_multi_x[] = {4, 4, 8, 4, 4, 8, 8};
int b_multi_y[] = {2, 4, 4, 4, 4, 8, 8};
int t_multi_x[] = {4, 4, 4, 8, 4, 8, 8};
int t_multi_y[] = {2, 4, 4, 4, 8, 4, 8};
/*
            PROBLEMA
Calcolare la sommatoria di una riga di una matrice con i
seguenti parametri
- N righe, con N = n°blocchi * n°thread
- X colonne
il risultato è un array di N elementi contenente le sommatorie di ogni riga
*/

//struttura della matrice da elaborare
typedef struct {
  int width;
  int height;
  float* elements;
} Matrix;

//struttura array delle sommatorie
typedef struct {
  int length;
  float* elements;
} Sums;


//Kernels
__global__ void MatSumOneDimKernel(const Matrix, Sums);
__global__ void MatSumMultiKernel(const Matrix, Sums);

//Funzione di chiamata per i Kernel
void MatSum(const Matrix A, Sums S_one_dim, Sums S_multi, int index)
{
    //array delle medie dei dati
    float mediat_one_dim = 0.0;
    float mediat_multi = 0.0;

    Matrix d_A;
    d_A.width = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc((void**) &d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

    Sums d_S;
    d_S.length = S_one_dim.length;
    size = S_one_dim.length * sizeof(float);
    cudaMalloc((void**) &d_S.elements, size);

    // WARMUP
    //MatSumOneDimKernel<<<b_one_dim[index], t_one_dim[index]>>>(d_A, d_S);

    //timer
    cudaEvent_t start;
    cudaEvent_t stop;
    float msecTotal;

   
    //-------ONE-DIM------------------
    for(int j = 0; j<CICLI; ++j)
    {
        cudaEventCreate(&start);
        cudaEventRecord(start, NULL);

        // Invoke kernel one-dim
        MatSumOneDimKernel<<<b_one_dim[index], t_one_dim[index]>>>(d_A, d_S);

        // stop and destroy timer
        cudaEventCreate(&stop);
        cudaEventRecord(stop, NULL);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&msecTotal, start, stop);

        //printf( "\n\nTime ONE-DIM %f (ms)\n\n", msecTotal);
        mediat_one_dim += msecTotal;
       
        if(verbose)
        {
            cudaMemcpy(S_one_dim.elements, d_S.elements, size, cudaMemcpyDeviceToHost);
        }
    }
    //-----------------------------------------

    //------BIDIMENSIONALE-------------------
    dim3 dimGrid(b_multi_x[index],b_multi_y[index]);
    dim3 dimBlock(t_multi_x[index], t_multi_y[index]);

    for(int j = 0; j<CICLI; ++j)
    {
        cudaEventCreate(&start);
        cudaEventRecord(start, NULL);

        // Invoke kernel multi
        MatSumMultiKernel<<<dimGrid, dimBlock>>>(d_A, d_S);
       
        // stop and destroy timer
        cudaEventCreate(&stop);
        cudaEventRecord(stop, NULL);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&msecTotal, start, stop);

        //printf( "\n\nTime Multi: %f (ms)\n\n", msecTotal);
        mediat_multi += msecTotal;

        if(verbose)
        {
            cudaMemcpy(S_multi.elements, d_S.elements, size, cudaMemcpyDeviceToHost);
        }
    }
    //-----------------------------------------

    cudaFree(d_A.elements);
    cudaFree(d_S.elements);

    printf( "\n\nONE-DIM: %d - %d\t\tMULTI: %dx%d - %dx%d\n\n", b_one_dim[index], t_one_dim[index], b_multi_x[index], b_multi_y[index],
    t_multi_x[index], t_multi_y[index]);
    printf( "\n\nTime One-dim: %f (ms)\t\tTime Multi: %f (ms)\n\n", mediat_one_dim/(float)CICLI, mediat_multi/(float)CICLI);
}


//Kernel ONE-DIM
__global__ void MatSumOneDimKernel(const Matrix A, Sums S)
{
    //ogni thread calcola la sommatoria di una riga di A
    float Svalue= 0;
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int row = (blockIdx.x * blockDim.x + threadIdx.x) * COLUMNS;

    for (int i=0; i<COLUMNS; i++)
    {
        Svalue += A.elements[row + i];
    }

    S.elements[index] = Svalue;

}

//Kernel BIDIMENSIONALE
__global__ void MatSumMultiKernel(const Matrix A, Sums S)
{
    //ogni thread calcola la sommatoria di una riga di A
    float Svalue= 0;
    int index = (blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x + blockDim.y) + threadIdx.x + threadIdx.y * blockDim.x;
    int row = ((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x + blockDim.y) + threadIdx.x + threadIdx.y * blockDim.x) * COLUMNS;

    for (int i=0; i<COLUMNS; i++)
    {
        Svalue += A.elements[row + i];
    }

    S.elements[index] = Svalue;

}


//Main
int main(int argc, char** argv)
{
    for(int index = 0; index<7; ++index)
    {
        //Matrice
        Matrix h_A;
        h_A.width = COLUMNS; h_A.height = b_one_dim[index]*t_one_dim[index];
        size_t size = h_A.width*h_A.height*sizeof(float);
        h_A.elements = (float*)malloc(size);

        //Vettore sommatorie per kernel ONE-DIM
        Sums h_S_one_dim;
        h_S_one_dim.length = b_one_dim[index]*t_one_dim[index];
        size = b_one_dim[index]*t_one_dim[index]*sizeof(float);
        h_S_one_dim.elements = (float*)malloc(size);

        //Vettore sommatorie per kernel BIDIMENSIONALE
        Sums h_S_multi;
        h_S_multi.length = b_one_dim[index]*t_one_dim[index];
        size = b_one_dim[index]*t_one_dim[index]*sizeof(float);
        h_S_multi.elements = (float*)malloc(size);

        //Inizializzazione della matrice
        int i = 0;
        for(int y=0; y<h_A.height; y++)
            for(int x=0; x<h_A.width; x++) {
                h_A.elements[y*h_A.width+x] = i;
                i++;
            }

        //Stampa di A
        if(verbose){
            printf("\nh_A\n");
            for(int y=0; y<h_A.height; y++){
                printf("\n");
                for(int x=0; x<h_A.width; x++) {
                    printf("%f ", h_A.elements[y*h_A.width+x]);
                }
            }
        }

        //Inizializzazione dei vettori
        for(int i=0; i<h_S_one_dim.length; i++)
        {
            h_S_one_dim.elements[i] = 0.0;
            h_S_multi.elements[i] = 0.0;
        }

        MatSum(h_A, h_S_one_dim, h_S_multi, index);

        //Stampa dei Vettori
        if(verbose)
        {
            printf("\n\n\nh_S_one_dim");
            for(int i = 0; i< h_S_one_dim.length; i++)
            {
                printf("\n");
                printf("%f ", h_S_one_dim.elements[i]);
            }

            printf("\n\n\nh_S_multi");
            for(int i = 0; i< h_S_multi.length; i++)
            {
                printf("\n");
                printf("%f ", h_S_multi.elements[i]);
            }

            //checker
            for(int i = 0; i< h_S_multi.length; i++)
            {
                if(h_S_multi.elements[i]-h_S_one_dim.elements[i] != 0.0)
                {
                    printf("\n\n\nERRORE DATI DISCORDANTI");
                }
            }
        }
    }
   
}

First of all, you have kernels that have relatively small thread counts. The largest has a thread count of 4096. That is too small to saturate most GPUs. As a result these kernels are making inefficient usage of the GPU. On top of that, you are making row-wise access to GPU memory per thread. This makes inefficient use of memory also.

So we are studying code that is not well optimized.

The main issue, however, is that your indexing in the 2D case is broken, so the kernels are not doing the same work, and this favors the 2D case.

If you did careful data consistency checking, you would discover that the 2D kernel you have written does not compute all rows. I don’t wish to argue this with you, nor do I wish to demonstrate it for you. If you do a better job of carefully isolating the results from each kernel, and do not use the same output buffer for both kernels, you can prove it to yourself. The reason your consistency test appeared to pass, is because the output buffer for the first kernel (the one dimensional, “correct” kernel) was filled correctly, and then reused for the second “incorrect” kernel. That confused you. Another trick to trap this sort of mistake is to cudaMemset the output buffer each time you use it. If you do that, you can quickly and easily prove to yourself that there is a design problem in your code.

Why does the 2D kernel not compute all rows? Because your indexing is broken. I’m not sure where you came up with it.

The defect is here:

int index = (blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x + blockDim.y) + threadIdx.x + threadIdx.y * blockDim.x;
                                                     ^^^^^^^^^^^^^^^^^^^^^^^

The first term in parentheses computes a global block index. So far so good. Then you multiply that by (blockDim.x + blockDim.y). But that is not correct. Your intention at this point was to create a global thread offset to the current block. To do that, you have to multiply the global block index by the total number of threads per block and that is the product of the dimensions, not the sum of the block dimensions. It should have been multiplied by (blockDim.x*blockDim.y).

If you fix your indexing, according to my testing, there is no significant perf difference between the two cases.

(For some reason you are computing identical terms in index and row. The indexing needs to be fixed in both, or perhaps more sensibly, compute index and then let your definition of row be int row = index * COLUMNS;)