I tried to parallelize this code segment but it failed to generate the correct answer.
Basically I only want to parallelize the outmost for loop.
All the code inside the outmost loop are expected to run sequentially by each thread.
The code is functionally correct. The only problem is how to use this directive correctly.
Any ideas about this ? It should be simple parallel model to be parallelized.
#pragma acc parallel loop copyin(in_data[num_record],threads,num_record,num_feature,new_poi\
nt[num_feature],k,z0,z1) copyout(rst[threads*k]) private(in_data_copy)
for (int i=0; i<threads; i++){
seed = 19 + i;
for (int j = 0; j< 4177; j++){
in_data_copy[j] = in_data[j];
for (int l =0; l <8; l++){
rand = generateGaussianNoise(0,0,&z0,&z1,&generate,&seed);
in_data_copy[j].record[l] = in_data[j].record[l] + rand;
}
}
knn(in_data_copy,num_record,num_feature,new_point,k,rst+i*k);
}
Also, several questions confused me here .
-
Did I really need to use the “copyin” and “copyout” here? Or when it would be a requirement to use “copyin” and “copyout”. I thought “copyin” and “copyout” might only
necessary when you want to share some data across multiple “parallel” region.
-
I used private here to declare “in_data_copy” as private for each thread, yet when I declare the size using [start:size] semantics, the compiler report errors. In the documents, it said “copy of item will be created for each parallel gang”, does it mean shared by the gang ?
Thanks.
- Did I really need to use the “copyin” and “copyout” here? Or when it would be a requirement to use “copyin” and “copyout”.
I’m not quite understanding the question. There’s no requirement other than the data needs to be available on the device. If the compiler can determine how to copy over the data, it will. However since you’re writing in C/C++ it’s more likely that you will need to add a data clause to indicate how much data to move or a “present” clause to indicate that the data is already over on the device.
“copyin” and “copyout” just indicate the direction which to copy the data. “copy” is both directions, while “create” only allocates memory but does not synchronize.
I thought “copyin” and “copyout” might only
necessary when you want to share some data across multiple “parallel” region.
These are data clauses, not to be confused with a data region. A data region can span across compute regions (as well as subroutines) and more has to do with the lifetime of the device variables. Data clauses just specify the direction to copy and size of the data. There’s also an “update” directive which can be used to synchronize device and host data from within a data region.
- I used private here to declare “in_data_copy” as private for each thread, yet when I declare the size using [start:size] semantics, the compiler report errors.
What was the error? From what I can tell “in_data_copy” is an array of structs. If it’s a fixed size struct, then you should be fine. But I’m guessing you have dynamic data members. In which case you can’t privatize them since the compiler has no way of knowing how big the struct is. Aggregate data types with dynamic data members are not supported within data clauses either. It’s the biggest limitation in OpenACC and one the standards committee is looking to address. But it’s a very difficult issue so it will take some time.
If you’re interested, my GTC2015 talk (https://www.youtube.com/watch?v=rWLmZt_u5u4) on OpenACC C++ Class Management touches upon the issue.
In the documents, it said “copy of item will be created for each parallel gang”, does it mean shared by the gang ?
If you privatize a variable on a gang loop, the variable will shared by all workers and vectors within the same gang.