work for sm20 but fail sm13

I started to use cuda 3 months ago, and this is my first post here. Plz bare with me if my question is very naive.

But I am really puzzled. I have a code of Monte Carlo simulation. It is not complicated, but use everything a numerical calculation needs, such as complex number operations, matrix multiplications, random number generations. In the end of every iterations, some measurement is taken throughout all threads, and results is transfered back to cpu, and summation is taken there.

I am using C2050. My code works perfect if I compile with sm20, I got results as theory expected. So I am sure the algorithm and arithmetics are doing the right thing. However, when compiled with option -arch=sm13, I got nonsense answers. Even after a few iterations, I got nan’s! I tried the cuda-gdb, with -g -G flags. Then it works, it does give right answers.

Why? I suspect it is related to memory allocation and accessing. I use texture to access global memory, and my data are arrays of matrix of float2. But I really don’t know what the differences are between sm13 and sm20 regarding memory. Plz point me to a proper direction, which I should search.