Exit from for cycle

I have simple code cycle in host, my cuda kernel perform some task and fill dev_found variable if true then exit cycle. But MemCopy Device to Host take some time on every cycle. How to avoid use host variable

for (int i = 0; i < 1000000; i++)
{
find_pivot_row << <numBlocks, blockSize >> > (dev_m, dev_b, dev_data, dev_index, dev_found);
bool found;
cudaMemcpy(&found, dev_found, sizeof(bool), cudaMemcpyDeviceToHost);

	if (!found)
	{
		break;
	}

}

you could try using pinned memory for the dev_found variable. That should eliminate the need for the cudaMemcpy operation. However you will still need a blocking statement in host code, so you would need to convert the cudaMemcpy operation to something like cudaDeviceSynchronize()

I think in practice there is not likely to be much performance difference between these 2 approaches, unless your find_pivot_row kernel is doing almost no work.

Another approach would be to move the for-loop (or perhaps convert to a while loop, etc.) inside the kernel, and use a cooperative groups grid sync. Again, I’m skeptical of any major performance difference using this approach.

It’s fairly evident you are on windows. Windows WDDM may require extra care with these ideas.