YA unspecified launch failure

I’ve gathered that an “unspecified launch failure” is generic, but often caused by errant memory access. Below I have some partial code, and I’ve isolated the problem to one line:

extern __shared__ signed int sharedMem[];

// [... some code ..]

__global__ kernel(signed int *timeSeriesArray, [...]) {

  signed int * globalReferenceTimeSeries = (signed int *)(timeSeriesArray + referenceTimeSeriesIndex);;

  signed int * sharedReferenceTimeSeries = (signed int *)(sharedMem);

  // [... some code ..]

  for(unsigned int i = 0; i < timeSeriesLength; i++)

    // Problem line below:

    sharedReferenceTimeSeries[i] = globalReferenceTimeSeries[i];

}

But when I do

for(unsigned int i = 0; i < timeSeriesLength; i++)

  signed int foo = globalReferenceTimeSeries[i];

OR

for(unsigned int i = 0; i < timeSeriesLength; i++)

  sharedReferenceTimeSeries[i] = i;

OR even

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

OR

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = 12;

  sharedReferenceTimeSeries[i] = foo;

No problem. Then I do:

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = foo;

And I’m back to the “unspecified launch failure” error with kernel termination.

Any ideas on:

  1. How this is even possible (my C/C++ is 15 years old, so consider me a n00b if it’s an obvious pointer issue or some such thing)

  2. How might I address this?

Thanks,

-CP

I’ve gathered that an “unspecified launch failure” is generic, but often caused by errant memory access. Below I have some partial code, and I’ve isolated the problem to one line:

extern __shared__ signed int sharedMem[];

// [... some code ..]

__global__ kernel(signed int *timeSeriesArray, [...]) {

  signed int * globalReferenceTimeSeries = (signed int *)(timeSeriesArray + referenceTimeSeriesIndex);;

  signed int * sharedReferenceTimeSeries = (signed int *)(sharedMem);

  // [... some code ..]

  for(unsigned int i = 0; i < timeSeriesLength; i++)

    // Problem line below:

    sharedReferenceTimeSeries[i] = globalReferenceTimeSeries[i];

}

But when I do

for(unsigned int i = 0; i < timeSeriesLength; i++)

  signed int foo = globalReferenceTimeSeries[i];

OR

for(unsigned int i = 0; i < timeSeriesLength; i++)

  sharedReferenceTimeSeries[i] = i;

OR even

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

OR

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = 12;

  sharedReferenceTimeSeries[i] = foo;

No problem. Then I do:

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  signed int foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = foo;

And I’m back to the “unspecified launch failure” error with kernel termination.

Any ideas on:

  1. How this is even possible (my C/C++ is 15 years old, so consider me a n00b if it’s an obvious pointer issue or some such thing)

  2. How might I address this?

Thanks,

-CP

In all but the first and last case the compiler will optimize away the access to [font=“Courier New”]globalReferenceTimeSeries[i][/font] because its value is not actually used. So this strongly hints at a problem with [font=“Courier New”]timeSeriesArray[/font] or [font=“Courier New”]referenceTimeSeriesIndex[/font].

In all but the first and last case the compiler will optimize away the access to [font=“Courier New”]globalReferenceTimeSeries[i][/font] because its value is not actually used. So this strongly hints at a problem with [font=“Courier New”]timeSeriesArray[/font] or [font=“Courier New”]referenceTimeSeriesIndex[/font].

I tested

signed int foo;

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

}

As well, which worked. Since “foo” is used later, I would expect the compiler not to be able to optimize away globalReferenceTimeSeries[i]. But

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  sharedReferenceTimeSeries[0] = globalReferenceTimeSeries[0];

}

Fails. So, there’s definitely something I’m missing here. I’m trying to reproduce the error in a “clean” kernel, but I am really at a loss here.

I tested

signed int foo;

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  foo = globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

}

As well, which worked. Since “foo” is used later, I would expect the compiler not to be able to optimize away globalReferenceTimeSeries[i]. But

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  sharedReferenceTimeSeries[0] = globalReferenceTimeSeries[0];

}

Fails. So, there’s definitely something I’m missing here. I’m trying to reproduce the error in a “clean” kernel, but I am really at a loss here.

Thanks tera, you were right. Very right.

  1. globalReferenceTimeSeries was a pointer to host memory in the kernel in question.
  2. “foo” was used later, but not with the assigned the value of globalReferenceTimeSeries[i]. That’s some sophisticated optimization.

Is there a way to get more verbose warnings (e.g., using a host pointer on a device)?

Thanks tera, you were right. Very right.

  1. globalReferenceTimeSeries was a pointer to host memory in the kernel in question.
  2. “foo” was used later, but not with the assigned the value of globalReferenceTimeSeries[i]. That’s some sophisticated optimization.

Is there a way to get more verbose warnings (e.g., using a host pointer on a device)?

If [font=“Courier New”]foo[/font] is used later, then probably all but the last read of [font=“Courier New”]globalReferenceTimeSeries[i][/font] are optimized away. This would point to a (usually less common) problem at the beginning of the array.

You can verify that theory by checking that

signed int foo = 0;

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  foo += globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

}

fails but

signed int foo = globalReferenceTimeSeries[timeSeriesLength-1];

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  sharedReferenceTimeSeries[i] = i;

}

works (assuming timeSeriesLength>0).

EDIT: Missed your last post. - glad you found the problem already. Unfortunately I don’t know of a tool that would catch such a problem at compile time. Sounds like it would be a fun project to adapt sparse to this.

Ocelot should have been able to catch this at runtime, though.

If [font=“Courier New”]foo[/font] is used later, then probably all but the last read of [font=“Courier New”]globalReferenceTimeSeries[i][/font] are optimized away. This would point to a (usually less common) problem at the beginning of the array.

You can verify that theory by checking that

signed int foo = 0;

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  foo += globalReferenceTimeSeries[i];

  sharedReferenceTimeSeries[i] = i;

}

fails but

signed int foo = globalReferenceTimeSeries[timeSeriesLength-1];

for(unsigned int i = 0; i < timeSeriesLength; i++) {

  sharedReferenceTimeSeries[i] = i;

}

works (assuming timeSeriesLength>0).

EDIT: Missed your last post. - glad you found the problem already. Unfortunately I don’t know of a tool that would catch such a problem at compile time. Sounds like it would be a fun project to adapt sparse to this.

Ocelot should have been able to catch this at runtime, though.