Help needed optimizing a kernel


I have the following kernel:


#define IDX2C(i,j,ld) (((j)*(ld))+(i))

global void kernel_elmcol_m_unrolled(int *num_threads,

                                     float *alpha,

                                     const float *x,

                                     int *incx,

                                     float *y,

                                     int *incy,

                                     int *i,

                                     int *i1,

                                     int *num_row_elements)


const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);

int k_num_threads = *num_threads;

if (tid < k_num_threads) {

    int k_incx = *incx;

    int k_incy = *incy;

    int k_i = *i;

    int k_i1 = *i1;

    int k_num_row_elements = *num_row_elements;

    int row_offset = tid % k_num_row_elements;

    int col_offset = tid / k_num_row_elements;

    y[IDX2C(k_i, k_i1+col_offset, c_num_equations)+row_offset*k_incy] +=

            alpha[col_offset] * x[IDX2C(k_i, k_i, c_num_equations)+row_offset*k_incx];




Now, this takes a really long time to run. Also, curiously when I use ‘=’ instead of ‘+=’ it runs about 3 times faster.

What possible things could be going wrong with this?



What card are you using?

Also what does IDX2C does, can you post the code?

The ‘=’ instead of the ‘+=’ probably indicates what the problem is: non-coalesced reads.

When you do += you need to first read the data from gmem, if you’re doing it in a non-coalesced way you’ll pay big penalties.

The ‘=’ simply writes the value to the memory where random writes are probably less expensive than random reads.


GTX 280

This just calculates the linear index from a 2D index. The top of the code defines the IDX2C macro.

Thanks for this. I will look into the documentation a bit more and maybe find a way to avoid this…

Why do you pass pointers to the arguments instead of the arguments themselves? This wastes a lot of memory bandwidth.

I wasn’t sure I could do that. Does it really make that much difference? It is still 4 bytes.



Arguments live in shared memory, so accessing them does not use up global memory bandwidth (and has about 1/100th of the latency).