hello,
the following line is happy to return to me a gift-wrapped segmentation fault
cudaMemcpyAsync(nlp_pnt_jac->h_temp_out_mul_sum,
nlp_pnt_jac->d_out_mul_sum, sizeof(double) * lint[0],
cudaMemcpyDeviceToHost, s[0]);
simply initializing nlp_pnt_jac->h_temp_out_mul_sum as ordinary memory, instead of pinned memory, of course removes the problem
cudaMallocHost(&nlp_pnt_jac->h_temp_out_mul_sum, sizeof(double) * 3);
vs
nlp_pnt_jac->h_temp_out_mul_sum = new double[3];
a cudaGetLastError() prior to the memory copy, and after the memory allocation did not return an error
MALLOC_CHECK_ set to 2, for what it is worth
the particular pinned memory allocation is part of several; the allocations must be pinned, as they generally catch key results returning from the device, asynchronously
cudaMallocHost(&nlp_pnt_jac->h_temp_out_mul_sum, sizeof(double) * 3);
cudaMallocHost(&nlp_pnt_jac->h_mul_sum_store, sizeof(double));
cudaMallocHost(&nlp_pnt_jac->h_le_out_status, sizeof(double));
cudaMallocHost(&nlp_pnt_jac->h_sug_coeff_delta, sizeof(double) *
nlp_pnt_jac->coeff_cnt);
interesting is that the debugger would show the address as unique, but the array starting value, as is generally reported by the debugger, is the same as that of one of the other pinned memory arrays
at this point, i am hypothesizing that the driver is somehow (incorrectly) grouping these small pinned memory allocations to reduce waste; it is the only way i can make sense of this
the next test then would be to allocate one pinned memory region equal to the total size needed, and to rather use pointers into this region
your views?