Program freezes machine after several runs , or cudaThreadSynchronize() and its effect.

I have program that runs fine several times. On random call, say on 10th, it stalls off the whole machine. Sometimes several previous calls return with timeout error. The kernel call is followed immideately by

cudaThreadSynchronize();

	ftime(&endTime); // stop timer

	cout << "Kernel finished with " << cudaGetErrorString(cudaGetLastError()) << endl;

I have no doubt that the program passes these three lines and continues execution (because it writes to a file afterwards), however the machine is not responding anymore and only reset helps. Sometimes a kind of noise appears on the screen as if the memory was corrupted.

I cudaFree() everything that was allocated using cudaMalloc(). However it seems to me that the program consumes more and more memory space but not frees it up on exit (previous 9 launches) and finally corrupts some memory space that is used by OS (the card is connected to the screen).

The question has been asked on this forum already, but again: is there some kind of memory protection in the architecture/driver?

What can consume memory on the card? As to my knowledge, everything declared in the kernel by “double a[3]” or “double z = 0.0” will be deallocated after host process exits (at most). Everything else is freed.

update:

If I reboot machine and run my program, it will fail most of the times (timeout error).
If I reboot machine, run Unigraphics NX5 (e.g. open a part file, etc. It’s a CAD/CAM/CAE PLM software), close Unigraphics NX5 and only then run my program, it runs well several times.

Configs: Windows Server 2003 x64 Enterprise, CUDA Toolkit 2.3, driver version 190.62, Visual Studion 2005, N260GTX card.