I have a code to calculate this formula: summation from k=0 to n-1 of (A^k ⊗ B^k), (⊗ symbolizes Kronecker product between matrices).
I used a for loop to calculate each matrix (A^k ⊗ B^k) and add it to the result matrix but the run-time is really high and unreasonable.
I’ve seen around the internet the rise in run-time happens as a result of queues, and that executing kernels inside for loops is generically problematic but i couldn’t find a good solution.
I tried to run each iteration on a few different streams but couldn’t get the run-time to drop.
I want to know why my stream solution doesn’t work and if there is a better or other solutions to this loop problem.
This is my code:
const int num_streams = 16;
cudaStream_t streams[num_streams];
//Timer start
cudaEventRecord(start, 0);
// Decleration
Matrix F; F.init();
Matrix G; G.init();
Matrix t1; t1.init();
Matrix t2; t2.init();
Matrix d_A; d_A.init();
Matrix d_Z; d_Z.init();
Matrix Γ_; Γ_.init(gSize, gSize, gSize);
// Allocating and copying the matrices
cudaMalloc(&d_Z.elements, d_size);
cudaMalloc(&d_A.elements, d_size);
cudaMemcpy(d_A.elements, A_.elements, d_size, cudaMemcpyHostToDevice);
cudaMalloc(&F.elements, d_size);
cudaMemcpy(F.elements, F_.elements, d_size, cudaMemcpyHostToDevice);
cudaMalloc(&G.elements, d_size);
cudaMalloc(&t1.elements, d_size);
cudaMalloc(&t2.elements, d_size);
cudaMalloc(&Γ_.elements, d_gsize);
//cudaHostAlloc((void**)&Γ_.elements, d_gsize * sizeof(int), cudaHostAllocDefault
// Compute Sigma k = 0 -> n - 1(A^k ⊗ B^k)
IdentityMatrixKernel << <dimGrid, dimBlock >> > (G);
IdentityMatrixKernel << <dimGrid, dimBlock >> > (t2);
matrixkroneckerProdKernel << <dimGridG, dimBlock >> > (G, t2, Γ_);
for (int k = 1; k < MAT_SIZE; k++)
{
if(k <= num_streams)
cudaStreamCreate(&streams[k % num_streams]);
if (k % 2 != 0)
{
MatMulLoopKernel << <dimGrid, dimBlock >> > (t1, F, G);
MatMulLoopKernel << <dimGrid, dimBlock >> > (t2, d_A, d_Z);
matrixkroneckerProdAddKernel << <dimGridG, dimBlock, 0, streams[k % num_streams] >> > (G, d_Z, Γ_);
}
else
{
MatMulLoopKernel << <dimGrid, dimBlock >> > (G, F, t1);
MatMulLoopKernel << <dimGrid, dimBlock >> > (d_Z, d_A, t2);
matrixkroneckerProdAddKernel << <dimGridG, dimBlock, 0, streams[k % num_streams] >> > (t1, t2, Γ_);
}
}
cudaDeviceSynchronize();
//Record time and destroy timer
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
// Print execution time
cout << "Decryption Running time: " << time << "ms" << endl;
IdentityMatrixKernel(M) - insert identity matrix values into M
MatMulLoopKernel(A, B, C) - calculates A*B into C
matrixkroneckerProdAddKernel (A, B, C) - calculates (A⊗B)+C into C
(each of the kernels works on its own)
MAT_SIZE = 128
dimBlock = (16, 16)
dimGrid = (MAT_SIZE/dimBlock.x, MAT_SIZE/dimBlock.y)
dimGrid = (MAT_SIZE^2/dimBlock.x, MAT_SIZE^2/dimBlock.y)
All the matrices are MAT_SIZEMAT_SIZE except Γ_ which is MAT_SIZE^2MAT_SIZE^2