For loop runtime optimization with streams

I have a code to calculate this formula: summation from k=0 to n-1 of (A^k ⊗ B^k), (⊗ symbolizes Kronecker product between matrices).
I used a for loop to calculate each matrix (A^k ⊗ B^k) and add it to the result matrix but the run-time is really high and unreasonable.
I’ve seen around the internet the rise in run-time happens as a result of queues, and that executing kernels inside for loops is generically problematic but i couldn’t find a good solution.
I tried to run each iteration on a few different streams but couldn’t get the run-time to drop.
I want to know why my stream solution doesn’t work and if there is a better or other solutions to this loop problem.

This is my code:

const int num_streams = 16;
	cudaStream_t streams[num_streams];

	//Timer start
	cudaEventRecord(start, 0);

	// Decleration
	Matrix F; F.init();
	Matrix G; G.init();

        Matrix t1; t1.init();
	Matrix t2; t2.init();

        Matrix d_A; d_A.init();
	Matrix d_Z; d_Z.init();

	Matrix Γ_; Γ_.init(gSize, gSize, gSize);

	// Allocating and copying the matrices
	cudaMalloc(&d_Z.elements, d_size);
	cudaMalloc(&d_A.elements, d_size);
        cudaMemcpy(d_A.elements, A_.elements, d_size, cudaMemcpyHostToDevice);

	cudaMalloc(&F.elements, d_size);
        cudaMemcpy(F.elements, F_.elements, d_size, cudaMemcpyHostToDevice);
	cudaMalloc(&G.elements, d_size);

	cudaMalloc(&t1.elements, d_size);
	cudaMalloc(&t2.elements, d_size);

	cudaMalloc(&Γ_.elements, d_gsize);
	//cudaHostAlloc((void**)&Γ_.elements, d_gsize * sizeof(int), cudaHostAllocDefault

	// Compute Sigma k = 0 -> n - 1(A^k ⊗ B^k)
	IdentityMatrixKernel << <dimGrid, dimBlock >> > (G);
	IdentityMatrixKernel << <dimGrid, dimBlock >> > (t2);
	matrixkroneckerProdKernel << <dimGridG, dimBlock >> > (G, t2, Γ_);

	for (int k = 1; k < MAT_SIZE; k++)
	{
		if(k <= num_streams)
			cudaStreamCreate(&streams[k % num_streams]);

		if (k % 2 != 0)
		{
			MatMulLoopKernel << <dimGrid, dimBlock >> > (t1, F, G);
			MatMulLoopKernel << <dimGrid, dimBlock >> > (t2, d_A, d_Z);
			matrixkroneckerProdAddKernel << <dimGridG, dimBlock, 0, streams[k % num_streams] >> > (G, d_Z, Γ_);
		}

		else
		{
			MatMulLoopKernel << <dimGrid, dimBlock >> > (G, F, t1);
			MatMulLoopKernel << <dimGrid, dimBlock >> > (d_Z, d_A, t2);
			matrixkroneckerProdAddKernel << <dimGridG, dimBlock, 0, streams[k % num_streams] >> > (t1, t2, Γ_);
		}
	}

	cudaDeviceSynchronize();

        //Record time and destroy timer
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&time, start, stop);

	// Print execution time
	cout << "Decryption Running time: " << time << "ms" << endl;

IdentityMatrixKernel(M) - insert identity matrix values into M
MatMulLoopKernel(A, B, C) - calculates A*B into C
matrixkroneckerProdAddKernel (A, B, C) - calculates (A⊗B)+C into C
(each of the kernels works on its own)

MAT_SIZE = 128
dimBlock = (16, 16)
dimGrid = (MAT_SIZE/dimBlock.x, MAT_SIZE/dimBlock.y)
dimGrid = (MAT_SIZE^2/dimBlock.x, MAT_SIZE^2/dimBlock.y)

All the matrices are MAT_SIZEMAT_SIZE except Γ_ which is MAT_SIZE^2MAT_SIZE^2