Need advices for optimizing "matrix.vector product&quot

Hello,

I have a “matrix.vector product code” written in C with already one levels of parallelism : OpenMP. I try to add an other second level with OpenACC directives on Tesla k40 GPU for comparison with OpenMP. What should I try to optimize?

Moreover, I have two GPUs on my computer. If I want to use two GPUs, what else I should do?

static void
_mat_vec_p_l_csr(bool                exclude_diag,
                 const cs_matrix_t  *matrix,
                 const cs_real_t    *restrict x,
                 cs_real_t          *restrict y)
{
  const cs_matrix_struct_csr_t  *ms = matrix->structure;
  const cs_matrix_coeff_csr_t  *mc = matrix->coeffs;
  cs_lnum_t  n_rows = ms->n_rows;

  /* Standard case */

  if (!exclude_diag) {

#   pragma omp parallel for
    for (cs_lnum_t ii = 0; ii < n_rows; ii++) {

      cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii];
      cs_real_t *restrict m_row = mc->val + ms->row_index[ii];
      cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii];
      cs_real_t sii = 0.0;

      for (cs_lnum_t jj = 0; jj < n_cols; jj++)
        sii += (m_row[jj]*x[col_id[jj]]);

      y[ii] = sii;

    }

  }

  /* Exclude diagonal */

  else {

#   pragma omp parallel for
    for (cs_lnum_t ii = 0; ii < n_rows; ii++) {

      cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii];
      cs_real_t *restrict m_row = mc->val + ms->row_index[ii];
      cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii];
      cs_real_t sii = 0.0;

      for (cs_lnum_t jj = 0; jj < n_cols; jj++) {
        if (col_id[jj] != ii)
          sii += (m_row[jj]*x[col_id[jj]]);
      }

      y[ii] = sii;

    }
  }

}

best,

Jackie

Hi Jackie,

What should I try to optimize?

There’s not a definitive answer here and may take a bit of experimentation. I find that there a few factors on whether or not to parallelize inner loops with reductions. There’s overhead cost in performing a reduction (partial reduction across each vector with a final sequential reduction after the loop), so you need enough work in the loop to offset the cost of the overhead. Given there’s only a single multiply, this might not be a good candidate.

Also, what’s the trip count? Looks like “n_cols” value changes from row to row, but if it’s always a larger value (like over a 1000), then you might justify the cost. If it’s small, then no.

What’s the data access pattern? With “m_row”, you’d be accessing across the stride-1 dimension (good!) but have a random access pattern with “x” (bad). However, you use both “const” and “restrict” with “x” so the compiler will most likely put “x” into textured memory and thus mitigate the poor memory access of “x”.

How big is “n_rows”? If it’s relatively small and “n_cols” is large, then you’ll more likely want to parallelize the “jj” loop in order to expose more parallelism. If “n_rows” is large, and “n_cols” is small, then it’s probably better to run the “ii” loop sequentially.

Sorry, way too many “it depends” statements. Actually, instead of going through all the different scenarios, it’s probably better to just try the two options and see what’s best for your code. Try something like the following two cases.

Case #1: Inner loop reduction

#pragma acc parallel loop present(ms,mc,x,y)
    for (cs_lnum_t ii = 0; ii < n_rows; ii++) { 

      cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii]; 
      cs_real_t *restrict m_row = mc->val + ms->row_index[ii]; 
      cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii]; 
      cs_real_t sii = 0.0; 
#pragma acc loop vector reduction(+:sii)
      for (cs_lnum_t jj = 0; jj < n_cols; jj++) 
        sii += (m_row[jj]*x[col_id[jj]]); 

      y[ii] = sii; 

    }

Case #2: Inner loop sequential

#pragma acc parallel loop gang vector present(ms,mc,x,y)
    for (cs_lnum_t ii = 0; ii < n_rows; ii++) { 

      cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii]; 
      cs_real_t *restrict m_row = mc->val + ms->row_index[ii]; 
      cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii]; 
      cs_real_t sii = 0.0; 
#pragma acc loop seq
      for (cs_lnum_t jj = 0; jj < n_cols; jj++) 
        sii += (m_row[jj]*x[col_id[jj]]); 

      y[ii] = sii; 

    }

Note that I used a “present” clause to tell the compiler that the data is already on the device. You’ll need to manage the data yourself at some point, but initially you can try using CUDA Unified Memory (UM) via the PGI flag “-ta=tesla:cc35,managed”. UM will manage the movement of dynamic data (you’ll still need to manage static data yourself) making it easier to start programming. Granted, the performance is often slower but so long as you profile your code (easiest to set the environment variable PGI_ACC_TIME, but using pgprof will give you more details), you can see the impact of the performance of above kernels. Later you can go back an optimize the data movement.

If I want to use two GPUs, what else I should do?

Do you use MPI? If so, then you just need to run with 2 MPI ranks and assign each rank to a particular GPU via a call to “acc_set_device_num”.

If you don’t use MPI, then you can use OpenMP to divide the work across multiple GPU. However, it’s a bit more difficult since you’ll need to perform the domain decomposition yourself by splitting the “ii” loop into multiple blocks. UM doesn’t yet work with a single process across multiple GPUs (MPI is fine) so you need to managed the data movement yourself. You’ll also need to copy all of “x” to both GPUs as well as “mc” and “ms”. “y” can be split across the GPUs but you’ll want to copy it back each time unless you know that you’ll decompose the data the same each time you call this routine.

I can help you further if you really do want to use OpenMP to parallelize across multiple GPUs, but in my opinion, MPI is the better way to go.

Finally, though there’s really no benefit of using multiple GPUs unless you have a very large domain. A K40 has 12GB of memory, so unless you’re using more than that, using multi-GPUs probably wont help you here.

  • Mat

Thanks, Mat,

Also, what’s the trip count? Looks like “n_cols” value changes from row to row, but if it’s always a larger value (like over a 1000), then you might justify the cost. If it’s small, then no.

Yes, but it’s always a small value.

but have a random access pattern with “x” (bad).However, you use both “const” and “restrict” with “x” so the compiler will most likely put “x” into textured memory and thus mitigate the poor memory access of “x”.

Should I change this access pattern for GPU parallel?

How big is “n_rows”?

it’s over a 1000.

If “n_rows” is large, and “n_cols” is small, then it’s probably better to run the “ii” loop sequentially.

That’ true. “n_rows” value changes from invoking this function each time, and it’s generally taper off.
I want to add an if statements for justifying the cost. If do, can I reduce the cost of the overhead? Example as follow:

#pragma acc parallel loop present(ms,mc,x,y) if(n_rows > 1000)



Case #1: Inner loop reduction
Case #2: Inner loop sequential

According to my description above, should I choose the second case?

Granted, the performance is often slower but so long as you profile your code (easiest to set the environment variable PGI_ACC_TIME, but using pgprof will give you more details), you can see the impact of the performance of above kernels. Later you can go back an optimize the data movement.

If I want to managing data explicitly, where do I add openACC directives? Outside of “for loop” or “mat_vec functions”?

Do you use MPI? If so, then you just need to run with 2 MPI ranks and assign each rank to a particular GPU via a call to “acc_set_device_num”.

Yes, but I really do want to use OpenMP to parallelize across multiple GPUs. Should I via a call to “acc_set_device_num”?

A K40 has 12GB of memory, so unless you’re using more than that, using multi-GPUs probably wont help you here.

I have a Tesla K40c(11439MiB) and a Quadro K5000(4035MiB). Will the latter one reduce the performance?

best,

Jackie

Hi Jackie,

According to my description above, should I choose the second case?

Most likely. Though, it’s easy to try both cases.

I want to add an if statements for justifying the cost. If do, can I reduce the cost of the overhead?

You can but you’ll want to experiment to see what the correct value is. Data movement should also be a consideration. With the “if” statement, you’re implying that you’ll be using the host copy of the data sometimes and the device copy others. If you’re planning on updating the data each time you enter the routine, then it’s not a problem. However, it’s preferred to copy the data to the device once, perform all computation on the device, then copy the data back at the end. Using the “if”, will prevent this.

If I want to managing data explicitly, where do I add openACC directives? Outside of “for loop” or “mat_vec functions”?

You can add the data directives as you enter this routine, but this will add overhead since you’ll be copying the data to the device every time the routine is called. You can start by adding them here, especially if you choose to use the “if” statement, but ideally you would manage the data higher up in your program.

I recommend using unstructured data regions at the same time as you build your data structures. Then use update directives to synchronize your data. For an example, you can get the example source code from Chapter 5 of Parallel Programming with OpenACC
on data management (I wrote the chapter).
See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC

In particular, look at how I did the data management in the “array_of_structs.c” example. If you’re using C++, there’s examples on using classes as well.

Yes, but I really do want to use OpenMP to parallelize across multiple GPUs. Should I via a call to “acc_set_device_num”?

You would use “acc_set_device_num” to assign each OpenMP thread to a device. I don’t have a small example to post, but here’s the basic steps you’ll need to do. I’m using pseudo-code.

// Get the number of threads
// Determine the block size for each thread
// OMP parallelize over the blocks.

#pragma omp parallel for
    for (cs_lnum_t block = 0; block < num_blocks; ++block) {

// Assign the OMP threads to a device
    devnum = thdid % number_of_devices;
    acc_set_device_num(devnum, acc_device_type());

// Determine which data gets copied to which device
    start_elem = (block*sizeof_block)
    end_elem = start_elem + sizeof_block;
    
// Create your data structures on each of the devices
   #pragma acc enter data copyin(y[start_elem:sizeof_block])

// Be sure to copy over all of "x" and other shared arrays to each device
   #pragma acc enter data copyin(x[0:sizeof_x])
... code to copy ms and mc (I don't know the structure so don't know how to copy them over)

  if (!exclude_diag) { 

#pragma acc parallel loop present(ms,mc,x,y)
for (cs_lnum_t ii = start_elem; ii < end_elem; ii++) { 
... do the computation
}

 } else {

#pragma acc parallel loop present(ms,mc,x,y)
for (cs_lnum_t ii = start_elem; ii < end_elem; ii++) { 
... do the exclude diagnonal computation
}

}

// Copy back the results and delete the device data
  #pragma acc exit data copyout(y[start_elem:sizeof_block])
  #pragma acc exit data delete(x)
... delete ms and mc structures.

} // end the omp block loop



I have a Tesla K40c(11439MiB) and a Quadro K5000(4035MiB). Will the latter one reduce the performance?

Possibly. You’d want to load balance where you size the blocks so that the K40 gets more work.

  • Mat