I have a kernel that moves data from two arrays into one new array so that I can sort it. I need to malloc this last array beforehand, so I calculate the number of elements to move over (not all the elements in the two arrays are copied, only the “good” ones). Since these two arrays describe the upper half (including the diagonal) of a Hermitian matrix, I think that the total number of elements should be the sum of 2*(number of good elements per row) - 1 (don’t count the diagonal twice). I have a kernel to figure out the number:

```
__device__ long d_num_Elem; //this is set to 0 in host code
__global__ void GetNumElem(long2* H_pos, int lattice_Size){
long row = blockDim.x*blockIdx.x + threadIdx.x;
atomicAdd(&d_num_Elem, (2*(H_pos[ idx(row, 0, 2*lattice_Size + 2) ]).y) - 1); //H_pos[row][0].y stores the number of good elements in that row
}
```

The idx(row, 0, 2*lattice_Size + 2) function maps H_pos[row][0] to its actual location in the 1D array. I copy over the value of d_num_Elem when this kernel is done running, and cudaMalloc H_sort to that size.

However, in a later function which fills up the sorting array. I need to know where to start putting in the good values for each row so I don’t overwrite things. To figure out the starting position, I use:

```
long row = blockIdx.x;
long start = 0;
for (long ii = 0; ii < row; ii++){
start += 2*(H_pos[ idx(ii, 0 , size1) ]).y - 1 ;
}
```

But I keep getting segfaults in H_sort from start being greater (232102 vs 232046) than d_num_Elem. How is that possible? Am I doing something wrong?