worked fine for "int" "float" but NOT "double"

Hi,

I have written a simple program with the kernel function below, and had made sure initialization steps are correct. The strange thing is that program produces correct results for “int” and “float”, but Not for “double”… Please help… thanks…


global void mytest(int *test1, double *test2){
int idx= threadIdx.x;

test1[idx]=idx;
test2[idx]= idx;
}


output:
test1 [1]=1; test1[2]=2;…
test2[1]=0; test2[2]=0;…

note: test1 and test2 are initialized to 0.

From the programming guide, page 24:

“Some ptx instructions are only supported on devices of higher compute capabilities. For example, atomic instructions on global memory are only supported on devices of compute capability 1.1 and above; double-precision instructions are only supported on devices of compute capability 1.3 and above.”

To expand on the previous post, you need to either set [font=“Courier New”]-arch sm_13[/font] in your call to [font=“Courier New”]nvcc[/font] or use the [font=“Courier New”]CUFILES_sm_1[/font]3 variable before including [font=“Courier New”]common.mk[/font] in your makefile.

Thank you Jamie for your quick reply!! External Image

I used “nvcc -arch sm_13” as you suggested from the page24, but still “double *test2” shows no sign of being executed in the kernel function. More specifically, “double *test2” always shows its initial value “0”… any more suggestions? Thanks…

Can you post the complete source file?

code attached. Thx…

Please change “mytest.txt” to “mytest.cu” before execution. I wasn’t allowed to upload files with".cu" extensions.
mytest.txt (1.24 KB)

I get the following output:

[codebox][ydd@localhost tmp]$ make

mkdir -p /usr/local/NVIDIA_CUDA_SDK/common//…/lib

mkdir -p obj/release

mkdir -p ./bin//release

/usr/local/cuda/bin/nvcc -o obj/release/test.cu_sm_13_o -c test.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I/usr/local/NVIDIA_CUDA_SDK/common//…/common/inc -DUNIX -O3 -arch sm_13

g++ -fPIC -o ./bin//release/HelloCUDA obj/release/test.cu_sm_13_o -L/usr/local/cuda/lib -L/usr/local/NVIDIA_CUDA_SDK/common//…/lib -L/usr/local/NVIDIA_CUDA_SDK/common//…/common/lib/linux -lcudart -L/usr/local/cuda/lib -L/usr/local/NVIDIA_CUDA_SDK/common//…/lib -L/usr/local/NVIDIA_CUDA_SDK/common//…/common/lib/linux -lcutil

[ydd@localhost tmp]$ ./bin/release/HelloCUDA

vnew[0] is 0; vnew2[0] is 0;vnew[0] is 0

vnew[1] is 1; vnew2[1] is 1;vnew[1] is 1

vnew[2] is 2; vnew2[2] is 2;vnew[2] is 2

vnew[3] is 3; vnew2[3] is 3;vnew[3] is 3

vnew[4] is 4; vnew2[4] is 4;vnew[4] is 4

vnew[5] is 5; vnew2[5] is 5;vnew[5] is 5

vnew[6] is 6; vnew2[6] is 6;vnew[6] is 6

vnew[7] is 7; vnew2[7] is 7;vnew[7] is 7

vnew[8] is 8; vnew2[8] is 8;vnew[8] is 8

vnew[9] is 9; vnew2[9] is 9;vnew[9] is 9

vnew[10] is 10; vnew2[10] is 10;vnew[10] is 10

vnew[11] is 11; vnew2[11] is 11;vnew[11] is 11

vnew[12] is 12; vnew2[12] is 12;vnew[12] is 12

vnew[13] is 13; vnew2[13] is 13;vnew[13] is 13

vnew[14] is 14; vnew2[14] is 14;vnew[14] is 14

vnew[15] is 15; vnew2[15] is 15;vnew[15] is 15

vnew[16] is 16; vnew2[16] is 16;vnew[16] is 16

vnew[17] is 17; vnew2[17] is 17;vnew[17] is 17

vnew[18] is 18; vnew2[18] is 18;vnew[18] is 18

vnew[19] is 19; vnew2[19] is 19;vnew[19] is 19

[ydd@localhost tmp]$

[/codebox]

I think that’s what you wanted, isn’t it?

What card do you have in your machine, and what’s your compile command?

Thanks YDD! External Image Yes, that’s what I wanted. Its really strange. I’m getting different results than yours! …Could be the system configurations???

Machine: Telsa S870.

Tesla S870 GPU Computing System

* Four GPUs (128 thread processors per GPU)

* 6 GB of system memory (1.5 GB dedicated memory per GPU)

* Standard 19”, 1U rack-mount chassis

* Connects to host via cabling to a low power PCI Express x8 or x16 adapter card

* Configuration: 2 PCI Express connectors driving 2 GPUs each (4 GPUs total)

Commands:

nvcc mytest.cu -o mytest

./mytest

Here is my output (I corrected the third column “vnew” to “vnew3”.

[codebox]vnew[0] is 0; vnew2[0] is 0;vnew3[0] is 3

vnew[1] is 1; vnew2[1] is 1;vnew3[1] is 3

vnew[2] is 2; vnew2[2] is 2;vnew3[2] is 3

vnew[3] is 3; vnew2[3] is 3;vnew3[3] is 3

vnew[4] is 4; vnew2[4] is 4;vnew3[4] is 3

vnew[5] is 5; vnew2[5] is 5;vnew3[5] is 3

vnew[6] is 6; vnew2[6] is 6;vnew3[6] is 3

vnew[7] is 7; vnew2[7] is 7;vnew3[7] is 3

vnew[8] is 8; vnew2[8] is 8;vnew3[8] is 3

vnew[9] is 9; vnew2[9] is 9;vnew3[9] is 3

vnew[10] is 10; vnew2[10] is 10;vnew3[10] is 3

vnew[11] is 11; vnew2[11] is 11;vnew3[11] is 3

vnew[12] is 12; vnew2[12] is 12;vnew3[12] is 3

vnew[13] is 13; vnew2[13] is 13;vnew3[13] is 3

vnew[14] is 14; vnew2[14] is 14;vnew3[14] is 3

vnew[15] is 15; vnew2[15] is 15;vnew3[15] is 3

vnew[16] is 16; vnew2[16] is 16;vnew3[16] is 3

vnew[17] is 17; vnew2[17] is 17;vnew3[17] is 3

vnew[18] is 18; vnew2[18] is 18;vnew3[18] is 3

vnew[19] is 19; vnew2[19] is 19;vnew3[19] is 3

[/codebox]

S870 is Compute 1.0, so you can’t do DP on it.

The S870 does not support double precision (it has G80 GPUs).
You need an S1070 for double precision

One more thing:

When I used command “nvcc mytest.cu -arch sm_13 -o mytest” and “./mytest”, I got following output where all three output seemed to be initial values.

[codebox]vnew[0] is 1; vnew2[0] is 2;vnew3[0] is 3

vnew[1] is 1; vnew2[1] is 2;vnew3[1] is 3

vnew[2] is 1; vnew2[2] is 2;vnew3[2] is 3

vnew[3] is 1; vnew2[3] is 2;vnew3[3] is 3

vnew[4] is 1; vnew2[4] is 2;vnew3[4] is 3

vnew[5] is 1; vnew2[5] is 2;vnew3[5] is 3

vnew[6] is 1; vnew2[6] is 2;vnew3[6] is 3

vnew[7] is 1; vnew2[7] is 2;vnew3[7] is 3

vnew[8] is 1; vnew2[8] is 2;vnew3[8] is 3

vnew[9] is 1; vnew2[9] is 2;vnew3[9] is 3

vnew[10] is 1; vnew2[10] is 2;vnew3[10] is 3

vnew[11] is 1; vnew2[11] is 2;vnew3[11] is 3

vnew[12] is 1; vnew2[12] is 2;vnew3[12] is 3

vnew[13] is 1; vnew2[13] is 2;vnew3[13] is 3

vnew[14] is 1; vnew2[14] is 2;vnew3[14] is 3

vnew[15] is 1; vnew2[15] is 2;vnew3[15] is 3

vnew[16] is 1; vnew2[16] is 2;vnew3[16] is 3

vnew[17] is 1; vnew2[17] is 2;vnew3[17] is 3

vnew[18] is 1; vnew2[18] is 2;vnew3[18] is 3

vnew[19] is 1; vnew2[19] is 2;vnew3[19] is 3

[/codebox]

I see…To be more specific, I am using XE320/Tesla Cluster with 16 XE320 compute nodes and I ran the code on one node.

btw, I couldn’t find any detailed document about using the Cluster. For example. “I updated /etc/vimhrc setting on the headnode, but this change does not replicate to all nodes”.

Could any of you plz suggest some documents ? You can send to my inbox if possible. Thanks a lot…

You need to upgrade to the S1070 if you want double precision (as tmurray and mfatica said). I’m a bit surprised that the CUDA runtime didn’t whinge when your Compute 1.3 kernel tried to run on the Compute 1.0 GPU, but it’s not something you should expect to work. As for using the cluster itself… that’s something you need to discuss with the people who run it - such things have a tendency to be uniquely temperamental :)

I guess I will use float instead then.

Thanks YDD, and thank you all for the help! External Image