Debugging tricks and tips for race condition

I am writing a program towards data analysis and have used dynamic parallelism (this along might not be the culprit). My program works fine for 1024 data points / objects. However, at 2048+ I am seeing symptoms of race conditions and the final result is not correct. I am not able to reproduce the problem at lower scale and hence finding it hard to debug.

I am writing in this forum to know what I can do to fix the bugs in the code that occur at higher data size / load. Please suggest what tools and techniques I could use to make my algorithm scale.

