[font=“Courier New”]I use a simple program to test the "-arch=sm_20"option.
Threads id are stored in global memory on GPU, and these are copied from device to host to check whether it are properly set.
My platform is NVIDIA S2050 with SLED11.1 and the computing capability of S2050 is 2.0.
The code is as follows.
#define NUM_SIZE 1024
global void kernel(double *dev_arr)
int tid = threadIdx.x + blockDim.x * blockIdx.x;
dev_arr[tid] = tid / 3.0;
int main(int argc, char* argv)
double bytes = sizeof(double) * NUM_SIZE;
double *host_arr = (double *)malloc(bytes); kernel<<<1, 1024>>>(dev_arr); cudaMemcpy(host_arr, dev_arr, bytes, cudaMemcpyDeviceToHost); printf("host_arr = %lf.\n", host_arr); return 0;
If the program is complied with “-arch=sm_20” option,The output isï¼š
host_arr = -1456815990147462891125136942359339382185244158826619267593931664968442323048246672764155341958241671875972237215762610409185128240974392406835200.000000.
If it compiled without this optionï¼ŒThe result is right.
If I change the line “dev_arr[tid] = tid / 3.0” in kernel into “dev_arr[tid] = tid;”,the result is right with or without this option. Of course, I can get the warning: Double is not supported. Demoting to float
If the threads number in block is not greater than 512, everything is right.But the S2050 can support 1024 threads in block, and why I cannot get right result with 1024 threads per block?
Thanks your help