Nvc++ PSTL: scan failed to synchronize: cudaErrorIllegalAddress

The code below works when --stdpar isn’t set (not passed to GPU), but fails when it is. The error being output is to obscure to figure out what is failing / where it’s failing:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  scan failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Can someone provide some hints as to what I’m doing wrong?

#include <iostream>
#include <iterator>
#include <string>
#include <vector>
#include <execution>
#include <algorithm>
#include <numeric>

using namespace std;
using namespace std::execution;

struct Element {
    int idx;
    int up;
    int upLeft;
    int res;
};

int calcWeight(char *a, char *b) {
    if (a == NULL || b == NULL) { // indel score
        return -1;
    } else if (*a == *b) {
        return 1;
    } else {
        return 0;
    }
}

int main(void) {
    vector<char> str1 {'T', 'A', 'C', 'T'};
    vector<char> str2 {'G', 'A', 'C', 'G', 'T'};

    vector<int> prevRow(str1.size() + 1);
    vector<Element> row(str1.size() + 1);

    transform(
            par_unseq,
            row.begin(),
            row.end(),
            row.begin(),
            [](auto x) {
                x.idx = 1;
                return x;
            }
    );
    exclusive_scan(
            par_unseq,
            row.begin(),
            row.end(),
            row.begin(),
            Element { 0, 0, 0, -999999},
            [](auto a, auto b) {
                Element ret;
                ret.idx = a.idx + b.idx;
                return ret;
            }
    );
    transform(
            par_unseq,
            row.begin(),
            row.end(),
            row.begin(),
            [&str2](auto x) {
				x.up = 0;
				x.upLeft = 0;
				x.res = -999999;
                return x;
            }
    );

    transform_exclusive_scan(
            par_unseq,
            row.begin(),
            row.end(),
            prevRow.begin(),
			0,
			[](auto a, auto b) {
    			return a + b;
    		},
            [&str1](auto x) {
                return calcWeight(&str1[x.idx], NULL);
            }
    );


    for (auto str2It = str2.begin(); str2It != str2.end(); str2It++) {
        auto ch2 = *str2It;
        transform_inclusive_scan(
                par_unseq,
                row.begin(),
                row.end(),
                row.begin(),
    			[](auto a, auto b) {
        			Element ret;
					ret.up = b.up;
					ret.upLeft = b.upLeft;
					ret.res = max({
						a.res + b.res,
						b.up,
						b.upLeft
					});
					return ret;
        		},
                [&prevRow, &str1, &ch2](auto x) {
					auto i = x.idx;
					x.up = prevRow[i] + calcWeight(NULL, &ch2);
					if (i == 0) {
					} else {
						x.upLeft = prevRow[i-1] + calcWeight(&str1[i-1], &ch2);
					}
					x.res = calcWeight(&str1[i], NULL);
					return x;
                }
        );
        transform(
                par_unseq,
                row.begin(),
                row.end(),
                prevRow.begin(),
                [](auto x) {
                    return x.res;
                }
        );
    }

    cout << "Final weight:" << prevRow.back();

Hi offbynul,

We think the problem is all of the capture-by-reference. There are several capture-by-reference of std::vector objects. While vectors always store their data on the heap, so it is safe to iterate through them, the vectors themselves are on the stack. Also, “ch2” is on the stack as well, so can be used in capture-by-reference.

I tried to write an example but found other issues going. When I compile with optimization targeting the host, the code seg faults. Same host code built with GNU 10.2 runs fine. Hence, I’ve passed this one to one of our C++ compiler engineers to take a look.

I’ll update you once I know more.

-Mat

Thank you.

Mat,

It’s sounding like PSTL GPU offload is still a work-in-progress. Does it make more sense to move directly to thrust? What I’m trying to do is get access to parallel primitives (reduce, map, prefix sum, sort, etc…) without having to write them myself as CUDA/SYCL kernels.

We were able to figure out that the segv I was seeing on the host was due to ret.idx not being initialized so some indices came back with garbage values.

		[](auto a, auto b) {
			Element ret;
                                    ret.idx = 0;
				ret.up = b.up;
				ret.upLeft = b.upLeft;
				ret.res = max({
					a.res + b.res,
					b.up,
					b.upLeft
				});
				return ret;

I’m still having trouble getting getting you variables off the stack and only passing in heap variables. But hopefully you’ll have better luck.

It’s sounding like PSTL GPU offload is still a work-in-progress.

Yes and no. C++ standard language parallelism is new so there could be issues. Though the problem here is that the GPU can’t access variables on the host’s stack or static memory. We implicitly allocate heap memory using CUDA Unified Memory so the memory is accessible on both the host and device. Hopefully at some point the device will be able to access the host’s stack, but this requires support in the host OS and CUDA Driver so is not expected in the near future.

Does it make more sense to move directly to thrust?

Possibly, but our C++ STDPAR implementation is built on top of Thrust so you’d likely encounter the same issues. However, you would be able to explicitly manage the data yourself, versus relying on UM.