main description:
I have got a parallelization application to process one data,there are about 6 kernels in the whole application.And I can get the correct result with this application. The question is the following:
Now assume that I got 10 datasets, generally, I will deal these data with a for loop,but to reach a higher speedup,I try to do this with cuda streams. And I want to get a effect as the following description:
Assume that dealing one dataset spend 10s, then deal these ten datasets will spend 10sĆ10(may be less). I want to get a 10 speedup with cuda streams,which is dealing 10 datasets will spend 10s(although I think this is impossible).But compared with the for loop version, the cuda streams version have seen no improvements.(I think there should be some improvements more or less)
The following is the main code with streams:
for (int i = 0; i < nstreams; i++)
{
checkCudaErrors(cudaMemcpyAsync(dev_X + i * m1 * n, h_X + i * m1 * n, m1*n * sizeof(float), cudaMemcpyHostToDevice, streams[i]));
dim3 sumGrid(4, m1);
dim3 sumBlock(1024, 1);
int sharedSize = sumBlock.x * sizeof(float);
sumReduction_kernel << <sumGrid, sumBlock, sharedSize, streams[i] >> > (dev_Xmean + i * m1, dev_X + i * m1 * n, m1, n);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
sub1_kernel << <sumGrid, sumBlock, 0, streams[i] >> > (dev_XFinal + i * m1 * n, dev_X + i * m1 * n, dev_Xmean + i * m1, m1, n);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
checkCudaErrors(cudaMemcpyAsync(h_Xfinal + i * m1 * n, dev_XFinal + i * m1 * n, sizeof(float) * m1 * n, cudaMemcpyDeviceToHost, streams[i]));
status = culaDeviceSgemm(
'T',
'N',
m1, m1, n,
CNSTn,
dev_XFinal + i * m1 * n, n,
dev_XFinal + i * m1 * n, n,
CNST0,
dev_sigma + i * m1 * m1, m1
);
checkStatus(status);
status = culaDeviceSgetrf(m1, m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
checkStatus(status);
status = culaDeviceSgetri(m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
checkStatus(status);
printf("%s\n", "CULA inverse had done!");
status = culaDeviceSgemm(
'N',
'T',
n, m1, m1,
CNST1,
dev_XFinal + i * m1 * n, n,
dev_sigma + i * m1 * m1, m1,
CNST0,
dev_buffer + i * m1 * n, n
);
checkStatus(status);
dist_kernel << <4, 1024, 0, streams[i] >> > (dev_buffer + i * m1 * n, dev_XFinal +i * m1 * n, dev_dist + i * n, n, m1);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
checkCudaErrors(cudaMemcpyAsync(host_dist + i * n, dev_dist + i * n, sizeof(float) * n, cudaMemcpyDeviceToHost, streams[i]));
}
I did use CULA library to do the matrix multiplication and inversion,and in Nsight I have seen all the CULA operation in the default stream.
Hope you can give some analysis and suggestions about this code about why I canāt get an obvious improvements.
(And any other way to do a muti task parallelization ?)
device : GTX1060 6GB
thank you very much