unspecified launch failure

Sorry for all these posts. Hoping to get from CUDA novice to CUDA mediocre one of these days!

I mentioned this briefly elsewhere, but I am having a problem. I have hundreds of .txt files stored in a folder for a days worth of data. So, for example, I have a folder 20100406 (i.e. April 6, 2010). My program reads each date folder, but I only keep one date in there because there’s about 400MB worth of data in each day’s folder. So, it’s like: /data/20100406/more_folders/*.txt_files.

To know which of the “more_folders” to open to read in their .txt files, I have a variables.lst file telling the CPU which ones to open. The problem is this. Whenever I change the .lst file to inspect a different set of folders in “more_folders”, my problem does not work. More specifically, the program launches 2 CPU threads corresponding to my 2 GPU devices, and either 1 or both kernel launches fail with the error “unspecified launch failure”. If only 1 kernel launch fails, the other one starts but hangs forever (as far as I can tell, it hangs permanently because I know it should have returned from its work within a certain timeframe when working correctly). With smaller data sets, both kernels fail and the program terminates normally. Upon running the program again, it works perfectly! It has been that way for a while now. It doesn’t work, and then it works great there on out.

Any ideas what would cause something like this to happen? It surely has to do with the multiGPU approach that I am taking. The code itself works fine, I have tested that. There’s some sort of initialization bug or something that CUDA doesn’t want to handle correctly. And no, I’m not accessing host information from the GPU to cause a kind of segmentation fault, which is what most people say that is often caused by.

Please help if you can.

Thanks!
Daniel

Whenever I toggle my program to run just 1 GPU device, this error doesn’t seem to ever occur. So, it has to do with multithreading the CPU to handle multiple GPU devices. Any thoughts on preventing the problem I am talking about?