 I have a kernel that moves data from two arrays into one new array so that I can sort it. I need to malloc this last array beforehand, so I calculate the number of elements to move over (not all the elements in the two arrays are copied, only the “good” ones). Since these two arrays describe the upper half (including the diagonal) of a Hermitian matrix, I think that the total number of elements should be the sum of 2*(number of good elements per row) - 1 (don’t count the diagonal twice). I have a kernel to figure out the number:

``````__device__ long d_num_Elem; //this is set to 0 in host code

__global__ void GetNumElem(long2* H_pos, int lattice_Size){

long row = blockDim.x*blockIdx.x + threadIdx.x;

atomicAdd(&d_num_Elem, (2*(H_pos[ idx(row, 0, 2*lattice_Size + 2) ]).y) - 1); //H_pos[row].y stores the number of good elements in that row

}
``````

The idx(row, 0, 2*lattice_Size + 2) function maps H_pos[row] to its actual location in the 1D array. I copy over the value of d_num_Elem when this kernel is done running, and cudaMalloc H_sort to that size.

However, in a later function which fills up the sorting array. I need to know where to start putting in the good values for each row so I don’t overwrite things. To figure out the starting position, I use:

``````long row = blockIdx.x;

long start = 0;

for (long ii = 0; ii < row; ii++){

start += 2*(H_pos[ idx(ii, 0 , size1) ]).y - 1 ;

}
``````

But I keep getting segfaults in H_sort from start being greater (232102 vs 232046) than d_num_Elem. How is that possible? Am I doing something wrong?

That is true on average or summed over all rows, but it’s not true for the individual rows. Assuming you want to construct the full square matrix with the redundant elements, the number of elements per row of course is a constant.

I don’t understand why. If the matrix is sparse, isn’t there no guarantee on the number of nonzero elements per row other than it is less than the total possible length of the row? If I have:

a 0 0 0 b

c 0 d e

``````f 0 g

h 0

i
``````

Then, when I “fill out” cols 0 and 1 into the sorting array, and start to process row 2, before any elements from row 2 show up I would have:

• From row 0, which had 2 elements: (a, b, b*) - n = 2*2 - 1 = 3

• From row 1, which had 3 elements: (c, d, d*, e, e*) - n = 2*3 -1 = 5

So there would be eight elements before I started to put any from row 2 (or its conjugates) in. And when I try to figure out the total number of nonzero elements I’ll have:

• From row 0, which had 2 elements: (a, b, b*) - n = 2*2 - 1 = 3

• From row 1, which had 3 elements: (c, d, d*, e, e*) - n = 2*3 -1 = 5

• From row 2, which had 2 elements: (f, g, g*) - n = 2*2 - 1 = 3

• From row 3, which had 1 element: (h) - n = 2*1 - 1 = 1

• From row 4, which had 1 element: (i) - n = 2*1 - 1 = 1

So the total number of elements would be 13