CUDA array filtering kernel without a for loop

I have a large array A with size_A rows and 6 columns. I am going to check the 4th element of each row, and if that is not zero, copy the row into another array B. Can I have the index to the entries of B without using a for loop, please see the below code?

I probably would need to define b_ptr somehow to make it static (similar to the what we have in C), but I think that is not allowed in CUDA.

__global__ void filtering_kernel(float* A, int size_A, float* B, float* size_B)
{
    /*B and size_B are the outputs*/
    int b_ptr = 0;
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    if (x > size_A) return;
    for (int i = 0; i < size_A; i++)
    {
        if (A[x + 3] != 0)
       {
            B[b_ptr] = A[x + 0];
            B[b_ptr + 1] = A[x + 1];
            B[b_ptr + 2] = A[x + 2];
            B[b_ptr + 3] = A[x + 3];
            B[b_ptr + 4] = A[x + 4];
            B[b_ptr + 5] = A[x + 5];
            b_ptr += 6;
            *size_B = *size_B + 1;
        }
    }
}

cross posting here