Dynamically extend reserved global memory within a thread.

Good morning, I have a question about the following task.


Given a two-dimensional array “a[N][M]” so N lines of length M.
Each element of the array contains an random integer value between 0 and 16.

Write a kernel “compact(int *a, int *listM, int *listN)” that consists of only one block of N threads, and each thread counts for one line of the array how many elements have a value of 16.
The threads write these numbers into an array “num” of length N in shared memory, and then (after a barrier) one of the threads executes the prefix code “PrefixSum(int *num, int N)” listed below (In the code below i explain, what this code does).
Finally (again barrier), each thread “Idx” writes the N- and M-values, respectively positions, (or “x- and y-coordinates”) of the elements of its row that have a value of 16 into two arrays “listM” and “listN” in global memory, starting at the position “num[Idx]” in these arrays.
In order to realize this last task more easily, there is the prefix code mentioned above.

I’ve written a kernel and a suitable main to test it. However, I still have a problem that I can not solve.

In the two arrays “listeM” and “listeN”, the individual positions of each 16 occurring in the array “a[M][N]” should be stored.
Therefore, their size must be equal to the total number of occurrences of 16, which may vary.
Since you do not know the exact number of elements with the value 16, you only know at runtime of the kernel how much memory is needed for the two arrays “listeM” and “listeN”.
Of course you could just release enough memory for the maximum possible number at program start, namely N times M, but that would be very inefficient.
Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays “listeM” and “listeN” after counting the number of elements with the value 16 in its row (just this number)?

Here is my Kernel:

__global__ void compact(int* a, int* listM, int* listN)
{
	int Idx = threadIdx.x;
	int elements, i;

	i = elements = 0;

	__shared__ int num[N];

	for (i = 0; i < M; i++)
	{
		if (a[Idx][i] == 16)
		{
			elements++;
		}
	}
	num[Idx] = elements;

        //Here at this point, the thread knows the number of elements with the value 16 of its line and would 
        //need to allocate just as much extra memory in "listeM" and "listeN". Is that possible ?

	__syncthreads();

	if (Idx == 0)
	{
                //This function sets the value of each element in the array "num" to the total value of the 
                //elements previously counted in all lines with the value 16.
                //Example: Input: num{2,4,3,1} Output: num{0,2,6,9}
		PrefixSum(num, N);
	}

	__syncthreads();

        // The output of PrefixSum(num, N) can now be used to realize the last task (put the "coordinates" of 
        //each 16 in the two arrays ("listM" and "listN") and each thread starts at the position equal the 
        //number of counted 16s).
	for (i = 0; i < M; i++)
	{
		if (a[Idx][i] == 16)
		{
			listM[num[Idx] + i] = Idx;
			listN[num[Idx] + i] = i;
		}
	}
}

Thx in advance, and have a nice Sunday.

https://stackoverflow.com/questions/59127547/cuda-dynamically-reallocate-more-global-memory-in-kernel

Hi Robert,
Thx for the detailed Answer.

Have a good day.