Confused with the "-arch=sm_20" option of nvcc

[font=“Courier New”]I use a simple program to test the "-arch=sm_20"option.
Threads id are stored in global memory on GPU, and these are copied from device to host to check whether it are properly set.
My platform is NVIDIA S2050 with SLED11.1 and the computing capability of S2050 is 2.0.
The code is as follows.

#include <stdio.h>
#define NUM_SIZE 1024

global void kernel(double *dev_arr)
{
int tid = threadIdx.x + blockDim.x * blockIdx.x;
dev_arr[tid] = tid / 3.0;
}

int main(int argc, char* argv)
{
double bytes = sizeof(double) * NUM_SIZE;
double dev_arr;
cudaMalloc((void
*)&dev_arr, bytes);

 double *host_arr = (double *)malloc(bytes);

 kernel<<<1, 1024>>>(dev_arr);

 cudaMemcpy(host_arr, dev_arr, bytes, cudaMemcpyDeviceToHost);

 printf("host_arr[0] = %lf.\n", host_arr[0]);
 return 0;

}

If the program is complied with “-arch=sm_20” option,The output is:
host_arr[0] = -1456815990147462891125136942359339382185244158826619267593931664968442323048246672764155341958241671875972237215762610409185128240974392406835200.000000.

If it compiled without this option,The result is right.

If I change the line “dev_arr[tid] = tid / 3.0” in kernel into “dev_arr[tid] = tid;”,the result is right with or without this option. Of course, I can get the warning: Double is not supported. Demoting to float

If the threads number in block is not greater than 512, everything is right.But the S2050 can support 1024 threads in block, and why I cannot get right result with 1024 threads per block?

Thanks your help

[/font]

The kernel isn’t running at all for the 1024 thread case because of insufficient registers. If you add appropriate error checking after the kernel launch, you should see a launch failure error.