Hi Jackie,
What should I try to optimize?
There’s not a definitive answer here and may take a bit of experimentation. I find that there a few factors on whether or not to parallelize inner loops with reductions. There’s overhead cost in performing a reduction (partial reduction across each vector with a final sequential reduction after the loop), so you need enough work in the loop to offset the cost of the overhead. Given there’s only a single multiply, this might not be a good candidate.
Also, what’s the trip count? Looks like “n_cols” value changes from row to row, but if it’s always a larger value (like over a 1000), then you might justify the cost. If it’s small, then no.
What’s the data access pattern? With “m_row”, you’d be accessing across the stride-1 dimension (good!) but have a random access pattern with “x” (bad). However, you use both “const” and “restrict” with “x” so the compiler will most likely put “x” into textured memory and thus mitigate the poor memory access of “x”.
How big is “n_rows”? If it’s relatively small and “n_cols” is large, then you’ll more likely want to parallelize the “jj” loop in order to expose more parallelism. If “n_rows” is large, and “n_cols” is small, then it’s probably better to run the “ii” loop sequentially.
Sorry, way too many “it depends” statements. Actually, instead of going through all the different scenarios, it’s probably better to just try the two options and see what’s best for your code. Try something like the following two cases.
Case #1: Inner loop reduction
#pragma acc parallel loop present(ms,mc,x,y)
for (cs_lnum_t ii = 0; ii < n_rows; ii++) {
cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii];
cs_real_t *restrict m_row = mc->val + ms->row_index[ii];
cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii];
cs_real_t sii = 0.0;
#pragma acc loop vector reduction(+:sii)
for (cs_lnum_t jj = 0; jj < n_cols; jj++)
sii += (m_row[jj]*x[col_id[jj]]);
y[ii] = sii;
}
Case #2: Inner loop sequential
#pragma acc parallel loop gang vector present(ms,mc,x,y)
for (cs_lnum_t ii = 0; ii < n_rows; ii++) {
cs_lnum_t *restrict col_id = ms->col_id + ms->row_index[ii];
cs_real_t *restrict m_row = mc->val + ms->row_index[ii];
cs_lnum_t n_cols = ms->row_index[ii+1] - ms->row_index[ii];
cs_real_t sii = 0.0;
#pragma acc loop seq
for (cs_lnum_t jj = 0; jj < n_cols; jj++)
sii += (m_row[jj]*x[col_id[jj]]);
y[ii] = sii;
}
Note that I used a “present” clause to tell the compiler that the data is already on the device. You’ll need to manage the data yourself at some point, but initially you can try using CUDA Unified Memory (UM) via the PGI flag “-ta=tesla:cc35,managed”. UM will manage the movement of dynamic data (you’ll still need to manage static data yourself) making it easier to start programming. Granted, the performance is often slower but so long as you profile your code (easiest to set the environment variable PGI_ACC_TIME, but using pgprof will give you more details), you can see the impact of the performance of above kernels. Later you can go back an optimize the data movement.
If I want to use two GPUs, what else I should do?
Do you use MPI? If so, then you just need to run with 2 MPI ranks and assign each rank to a particular GPU via a call to “acc_set_device_num”.
If you don’t use MPI, then you can use OpenMP to divide the work across multiple GPU. However, it’s a bit more difficult since you’ll need to perform the domain decomposition yourself by splitting the “ii” loop into multiple blocks. UM doesn’t yet work with a single process across multiple GPUs (MPI is fine) so you need to managed the data movement yourself. You’ll also need to copy all of “x” to both GPUs as well as “mc” and “ms”. “y” can be split across the GPUs but you’ll want to copy it back each time unless you know that you’ll decompose the data the same each time you call this routine.
I can help you further if you really do want to use OpenMP to parallelize across multiple GPUs, but in my opinion, MPI is the better way to go.
Finally, though there’s really no benefit of using multiple GPUs unless you have a very large domain. A K40 has 12GB of memory, so unless you’re using more than that, using multi-GPUs probably wont help you here.