How to send data to host during long task on device?

Hi there, I am a newbie to CUDA, though I have written a few simple parallel codes already :)

Here is what I am interested in and not have figured out thus far. The code has to analyze a large set of data, e.g. a vector of chars (actually, I work with genomic sequences). After loading the data from file, I send it to device, and launch the kernel function, creating threads which analyze their own parts of the data. But during the analysis I need to send certain information back to host, to output it into file, or simply to indicate the progress. As I might guess, streams would work here, but is there a simpler work-around?

Just not to stand up twice, one more question. As I understand, it is impossible to synchronize ALL threads, but only those within the same block. But as in a previous example with genomic sequence, I need to synchronize ALL threads before, say, sending information to host. Does it imply that the only way for me to do so is to launch the kernel only in one block?