What I want to achieve is a total length_ Theta * M threads are calculated together because there is no data dependency between them, but it seems that the compiler does not understand it that way
With the “kernels” construct, the compiler must prove there are no dependencies in order to parallelize the loops. However since you’re using computed indices, the compiler can’t tell if the accesses to “a” are independent across loop iterations.
To fix, either use “kernels loop independent” or the “parallel” construct where “independent” is the default. “independent” asserts to the compiler that there are no dependencies.