cudaEventRecord() segmentation fault

little_jimmy · May 10, 2015, 2:52pm

hello,

i am lost for words

a cudaEventRecord() immediately after a function call that contains the stream memory copy to be event-tagged, instantiates a segmentation fault, whereas the same cudaEventRecord() within the mentioned function, just after the mentioned memory copy, does not

the same cudaEvent_t event and cudaStream_t stream are at play in both cases; hence it can not be a case of the event or stream not being properly initialized or what ever

why is this…?

the function call, with the trailing cudaEventRecord():

nlp_pnt_jac_post_init_arr_setup(&nlp_pnt_jac, nlp_data,
s[0], e[nlp_pnt_jac_stream_cnt]);

cudaEventRecord(e[nlp_pnt_jac_stream_cnt], s[0]);

the function declaration, with the within cudaEventRecord():

void nlp_pnt_jac_post_init_arr_setup(NLP_pnt_jac* nlp_pnt_jac, NLP_data* nlp_data,
cudaStream_t s0, cudaEvent_t trigger1);

cudaEventRecord(trigger1, s0);

Robert_Crovella · May 10, 2015, 3:51pm

If you’re suggesting there’s a problem in something you’ve shown, I don’t think so. I think the problem is in something you haven’t shown.

$ cat t765.cu


typedef int NLP_pnt_jac;
typedef int NLP_data;

void nlp_pnt_jac_post_init_arr_setup(NLP_pnt_jac* nlp_pnt_jac, NLP_data* nlp_data,
cudaStream_t s0, cudaEvent_t trigger1){

  cudaEventRecord(trigger1, s0);}

int main(){

  NLP_pnt_jac nlp_pnt_jac;
  NLP_data nlp_data[10];
  cudaStream_t s[10];
  cudaEvent_t e[10];
  cudaStreamCreate(&(s[0]));
  cudaEventCreate(&(e[0]));
  int nlp_pnt_jac_stream_cnt = 0;

  nlp_pnt_jac_post_init_arr_setup(&nlp_pnt_jac, nlp_data,
s[0], e[nlp_pnt_jac_stream_cnt]);

cudaEventRecord(e[nlp_pnt_jac_stream_cnt], s[0]);

  return 0;
}
$ nvcc -o t765 t765.cu
$ cuda-memcheck ./t765
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

General stack corruption in your program seems like a possibility.

little_jimmy · May 11, 2015, 6:13am

"General stack corruption in your program seems like a possibility. "

MALLOC_CHECK_ is set to 2; so, i am not sure how to interpret this

also, this is a functional enhancement - a new program branch with number of additional functions - added to a base program generally considered to be ‘clean’

i think your test is overly simplistic, compared to my program
i did not want to write a test case, as i am still debugging, but actually did
i manage to reproduce the instance of the segmentation fault, if not the error

add a breakpoint to line 43 of funct.cu, and (attempt to) step it; see if you too register a segmentation fault

you can dump all the files in the root of a project
the test code only contains enough to reproduce the instance, and not all code
as the instance occurs on the program (branch in my case) commencing, and not when it is starting to terminate, i have not included the additional clean-up functions

perhaps i am doing something ridiculously stupid; perhaps not

i normally put the event record within the function; this time i inserted it after the function
hence the reason why i picked this up only now

the zip file should contain 7 files - 5 headers; 2 source files
test_event.zip (8.95 KB)

Robert_Crovella · May 11, 2015, 9:02am

what’s this:

void nlp_pnt_jac_post_init_arr_setup(NLP_pnt_jac* nlp_pnt_jac, NLP_data* nlp_data,
        cudaStream_t s0, cudaEvent_t trigger1)
{
        int lint[0];
                 ^
                 |
                 a zero-length array

If I change that to:

int lint[1];

the error goes away for me. I think it might be a form of stack corruption that MALLOC_CHECK won’t necessarily point out.

little_jimmy · May 11, 2015, 9:19am

…

Robert_Crovella · May 11, 2015, 9:30am

Yes, I had written something, it was dumb, so I removed it. But I changed it above to what I think the issue is.

little_jimmy · May 11, 2015, 9:44am

yes, i picked it up a second ago, after having commented away half the program…

maybe i picked it up, because you picked it up…

a typical copy-paste error; but how can the debugger continue with an int array initialized to size 0???

this is again another case where the debugger - or whatever - lets one down, because it allows too much
i have had this feeling many times, after finding ridiculous errors countless hours later: the debugger allows too much
you debug, you become blind because the code appears ‘familiar’, and the debugger does not break or halt, so you assume the code must be correct, only to be pulled way back far down the line

thanks, txbob
remind me to thank you again tomorrow

Robert_Crovella · May 11, 2015, 9:59am

I think you might have to blame the C language (perhaps C arrays specifically) rather than the debugger.

I think a C implementation could easily set the following variable to zero:

int *lint;

which a debugger might easily spot (although the language makes no specific requirements in this respect that I am aware of.)

I don’t think a typical C implementation sees any difference between:

int lint[0];

and

int lint[1];

In both cases (pretty much regardless of what the array length is) the C implementation will have lint point to some region on the stack, and then reserve space on the stack (at that location) equal to the length of the array. When the length of the array is zero, then the “next” allocated stack variable might be sitting right there. In either of the above cases, the lint “pointer” might have the exact same numerical value (pointing to some region on the stack, and there may be no pointers at all, these are probably just stack references, i.e. addresses of locations on the stack, computed by the compiler).

The debugger has no way of knowing, based on the lint value, what the length of the array is. In other words, by the time you get to compiled code, I think the knowledge that the array length is zero is “gone”. There would be no “reasonable” way for a debugger to discover that.

I think the best you could hope for is a warning at the compiler level. But since it is valid C code, there might be some use-cases for it (perhaps). Maybe there is some verbose warning you can enable for this, or maybe there is some lint-like tool that could spot this for you, which you would drag out before pulling your hair out.

little_jimmy · May 11, 2015, 10:08am

some of my host functions are complex rather than simplistic, requiring a number of int counters and pointers to complete
hence, i quickly end up with:

int lint1, lint2, … lint12;

therefore the switch to int lint across the board

and the debugger actually very happily stepped lint[0], with lint initialized to 0
but you explanation does give a reason as to how this could have happened
the debugger correctly stepped/ executed the code, so i was blissfully unawares

i do not think i need a verbose warning (anymore); some errors you simply do not repeat, for obvious reasons…

Robert_Crovella · May 11, 2015, 3:06pm

Apparently there is a use for zero-length C arrays:

https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

This is apparently considered an “extension” to ISO C

however apparently ISO C++ forbids them:

$ nvcc -Xcompiler -Wpedantic -o test funct.cu main.cu 2>&1 |grep "zero-size"
funct.cu:308:11: warning: ISO C++ forbids zero-size array âlintâ [-Wpedantic]
$