edit: deleted the benchmark-related question, because of rather not being an intelligent one.
And another question: what is the preferred way of error-handling inside a cuda-kernel? I have a loop ,that launches the kernel n time, that I want to end if an error occurs inside the kernel. Is there a faster way to notify the host-thread than reading some error-flag from global device memory?
Thanks for answers,