(little_jimmy's project is now) ashes and dust: pinned memory segmentation fault

hello,

the following line is happy to return to me a gift-wrapped segmentation fault

cudaMemcpyAsync(nlp_pnt_jac->h_temp_out_mul_sum,
nlp_pnt_jac->d_out_mul_sum, sizeof(double) * lint[0],
cudaMemcpyDeviceToHost, s[0]);

simply initializing nlp_pnt_jac->h_temp_out_mul_sum as ordinary memory, instead of pinned memory, of course removes the problem

cudaMallocHost(&nlp_pnt_jac->h_temp_out_mul_sum, sizeof(double) * 3);

vs

nlp_pnt_jac->h_temp_out_mul_sum = new double[3];

a cudaGetLastError() prior to the memory copy, and after the memory allocation did not return an error
MALLOC_CHECK_ set to 2, for what it is worth

the particular pinned memory allocation is part of several; the allocations must be pinned, as they generally catch key results returning from the device, asynchronously

cudaMallocHost(&nlp_pnt_jac->h_temp_out_mul_sum, sizeof(double) * 3);
cudaMallocHost(&nlp_pnt_jac->h_mul_sum_store, sizeof(double));
cudaMallocHost(&nlp_pnt_jac->h_le_out_status, sizeof(double));
cudaMallocHost(&nlp_pnt_jac->h_sug_coeff_delta, sizeof(double) *
nlp_pnt_jac->coeff_cnt);

interesting is that the debugger would show the address as unique, but the array starting value, as is generally reported by the debugger, is the same as that of one of the other pinned memory arrays

at this point, i am hypothesizing that the driver is somehow (incorrectly) grouping these small pinned memory allocations to reduce waste; it is the only way i can make sense of this

the next test then would be to allocate one pinned memory region equal to the total size needed, and to rather use pointers into this region

your views?

print out lint[0] immediately prior to making the call to cudaMemcpyAsync, in the case where the seg fault occurs.

Other than that, I would suggest providing a (as short as possible) complete code that someone else could run, and see the issue.

seems no bug can escape the supreme wrath and fury of texas ranger txbob

in contemplating the reproduction of the instance, i shifted the particular memory copy to just after the application’s memory footprint has been initialized, and just before the core functions being called, to confirm that the memory copy is location-independent, and to estimate the minimum code necessary to reproduce

turned out that the memory copy with segmentation fault was indeed location dependent, and i could isolate and identify another array allocated to the wrong size, this way

txbob, if my memory serves me, you once mentioned that memcheck can be combined with the debugger - something in that line
seemingly, MALLOC_CHECK_ is host-side inclined, and lacking
i need a stronger tool as 5 minutes into my application/ code, i start to i) forget, ii) assume, and iii) expect; hence my significant reliance on tools to keep me on the path
i do not necessarily see sending a program still in pre-debugging state to a tool like valgrind, etc as an option

I think this is all host-side code. I’m not sure what the best tool is to capture stack corruption (if that is what is happening here). I agree that MALLOC_CHECK is limited and will not catch some forms of stack corruption. I’m not sure what the best tool is.

it is mixed code; but the host side arrays and structures are piling up, as it is much experimental code - i am constantly adding functionality bits, to note effect

i have come to appreciate the debugger immediately halting on a device side segmentation fault - any thread accessing an address outside of what has been assigned in the context of that array
i more or less expect(ed) the same with host side arrays, but on a number of occasions i have now accessed arrays outside of what has been allocated, and the error typically manifests down the line as something completely different, much harder to debug