printf in cuda kernel

I want to print all thread id in the kernel. But I found it would not print all messages. For example, I have 12 million threads, it only printed 4 million threads id, some printf just lost. I have set the buffer size to 2GB using cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size) as well as called cudaDeviceSynchronize after execution, it also did not work. Any way can fix this?

Don’t do that :-) Device-side printf wasn’t designed for a scenario like this.

The most important question is probably what you hope to achieve by printing from all these threads, generating massive amounts of text data in the process. Is this for debugging?

Can you reduce the number of threads? Using on the order of 100,000 threads should be sufficient to make efficient use of even the largest GPU available now and the foreseeable future.