Cannot achieve kernel concurrency

toni574 · May 20, 2015, 8:32am

Hi,

I cannot achieve kernel concurrency with the code below (don’t mind if the code actually does not do anything smart - this is just a test). I have compiled with the --default-stream per-thread option, which basically sets a default stream for each host thread - feature available in CUDA 7. I have confirmed in Visual Profiler that indeed a CUDA stream is created for each host thread, but still cannot get kernels running concurrently - they seem to be serialized. I am using 16 host threads. Can anyone see the problem?

global void matoperation(unsigned char *s1, unsigned char *s2, unsigned char *d, int rows, int cols, int pitch)
{
int r = blockIdx.x * blockDim.x + threadIdx.x;
int c = blockIdx.y * blockDim.y + threadIdx.y;
for (int i = 1; i < 100000; i++)
if ((r < rows) && (c < cols))
d[r * pitch + c] = (((s1[r * pitch + c] / 3 + s2[r * pitch + c] / 3) / 7 + s2[r * pitch + c]
- s1[r * pitch + c]) * s1[r * pitch + c]) / s2[r * pitch + c];
}

void DummyOperation(unsigned char *Src1, unsigned char *Src2, unsigned char *Dst, int rows, int cols, int pitch)
{
matoperation<<<1, 1>>>(Src1, Src2, Dst, rows, cols, pitch);
cudaStreamSynchronize(0);
}

// Thread function (implemented using boost library)
void Process(int ThreadID)
{
…
// Here set rows, cols and pitch
// Here allocate memory for Src1GPU, Src2GPU, DstGPU at the device using cudaMallocPitch
// to make sure that memory at the device is allocated only once when this thread starts
…
try
{
while(1)
{
boost::this_thread::interruption_point();
…
// Here fetch new matrix from somewhere in the RAM
// copy matrix to the device (NOT using async copy)
…
DummyOperation(Src1GPU, Src2GPU, DstGPU, rows, cols, pitch);
…
// here copy result from device into host matrix (NOT using async copy)
//do something with this copy…
}
} catch (boost::thread_interrupted&)
{
printf(“Process thread interrupted\n”);
}
}

little_jimmy · May 20, 2015, 9:42am

“I am using 16 host threads”

host threads used :: cpu cores available :: properties[cudaStreamSynchronize(0)]

proposition: depending on the attributes/ properties of your cudaStreamSynchronize(0) implemented, if said cudaStreamSynchronize(0) is set to busy-wait, you may experience mild to severe host thread contention, manifesting as a serialization of stream activity

solution: what goes up must come down, but let it go up first
separate the launching of kernels, and the synchronization of kernels, such that all kernels are launched first, before threads start to wait on kernels, and/ or set the appropriate synchronization flags, to relieve host contention

Topic		Replies	Views
Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default- CUDA Programming and Performance	7	3061	October 12, 2021
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5648	April 28, 2012
My streams are not running concurrently CUDA Programming and Performance	7	1900	March 6, 2018
Kernel launch concurrency CUDA Programming and Performance	10	1950	December 11, 2014
Unable to achieve concurrency in kernel launches CUDA Programming and Performance	2	935	February 12, 2016
Is this strange behaviour with kernel concurrency? CUDA Programming and Performance	5	1040	September 3, 2015
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	757	October 8, 2016
Concurrent executions of streams CUDA Programming and Performance	6	518	December 19, 2022
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	590	November 16, 2017
Concurrency about default stream CUDA Programming and Performance	3	2843	March 23, 2015

Cannot achieve kernel concurrency

Related topics