sinf() - cosf() doesn't work on GPU?

Hi,

I wanted to try second example in this page: http://www.pgroup.com/lit/articles/insider/v4n1a1b.htm. Source code is also located $PGI/linux86-64/13.10/etc/samples/openacc/acc_c2.c.

Compile with following:

pgcc -o acc_c2i.exe acc_c2i.c -acc -Minfo=accel -Mcuda

Compilation is ok but it has running time error:

cc_c2i.exe: acc_c2i.c:74: main: Assertion `__builtin_fabsf(r > - e> ) < 0.000001f’ failed.

Accelerated code is quite simple:

#pragma acc kernels loop
for( i = 0; i < n; ++i ){
    s = sinf(a[i]);
    c = cosf(a[i]);
    r[i] = s*s + c*c;
}

I inserted some line to print data:

...
chost = usec(t2,t3);
for( i = 0; i < n; ++i )
    printf("%f : %f - %f\n", a[i], r[i], e[i]);
/* check the results */
for( i = 0; i < n; ++i )
    assert( fabsf(r[i] - e[i]) < 0.000001f );
...

After the result:

$ ./acc_c2i.exe 5
2.000000 : -nan - 1.000000
4.000000 : -nan - 1.000000
6.000000 : -nan - 1.000000
8.000000 : -nan - 1.000000
10.000000 : -nan - 1.000000
acc_c2i.exe: acc_c2i.c:73: main: Assertion `__builtin_fabsf(r[i] - e[i]) < 0.000001f' failed.

Any idea why GPU doesn’t calcutate sinf() / cosf()?

Thanks for any help!

Best regards,
Simi

Hi Simi,

I tested that example here and it seems fine for my runs. My guess is something else is going on which is causing the kernel to fail and your results are returning garbage.

Does the -Minfo compiler feedback message report any issues?
Are there any error messages from the device?
Could Bumblebee be capturing device errors and not reporting them?
Are you able to successfully run a CUDA C program?

  • Mat

Hi Mat,

Here is the compilere log:

$ pgcc -o acc_c2i.exe acc_c2i.c -acc -Minfo=all
main:
54, Generating present_or_copyout(r[0:n])
Generating present_or_copyin(a[0:n])
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
55, Loop is parallelizable
Accelerator kernel generated
55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

I’v run with debug option:

$ PGI_ACC_DEBUG=1 optirun ./acc_c2.exe
ACC: detected 1 CUDA devices
ACC: device[1] is NVIDIA CUDA device 0
argument memory for queue 8 device:0x500100000 host:0x200000000
ACC: initialized 1 CUDA devices
ACC: device[2] is PGI native
ACC: device[0] is PGI native
pinitialize for thread 1
curr_devid for thread 1 is 1
pgi_uacc_begin( compute region, file=/opt/pgi/linux86-64/13.10/etc/samples/openacc/acc_c2i.c, function=main, lines=32:76, startline=54, endline=60, devid=0, threadid=1 )
pgi_uacc_begin( file=/opt/pgi/linux86-64/13.10/etc/samples/openacc/acc_c2i.c, function=main, lines=32:76, startline=54, endline=60, devid=1, threadid=1 ) dindex=1
pgi_uacc_enter( devid=1 )
pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon(devptr=0x0,hostptr=0x7f959f894010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=54,name=r,flags=0xb00=create+present+copyout,threadid=1)
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7f959f894010
pgi_uacc_alloc(size=4000000,devid=1,threadid=1)
allocate device memory 0x500200000(4000256B)
pgi_uacc_alloc(size=4000000,devid=1,threadid=1) returns 0x500200000
map dev:0x500200000 host:0x7f959f894010 size:4000000 offset:0 data[dev:0x500200000 host:0x7f959f894010 size:4000000] (line:54 name:r)
alloc done with devptr at 0x500200000
pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon(devptr=0x0,hostptr=0x7f959fc65010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=54,name=a,flags=0x700=create+present+copyin,threadid=1)
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7f959fc65010
pgi_uacc_alloc(size=4000000,devid=1,threadid=1)
allocate device memory 0x5005e0000(4000256B)
pgi_uacc_alloc(size=4000000,devid=1,threadid=1) returns 0x5005e0000
map dev:0x5005e0000 host:0x7f959fc65010 size:4000000 offset:0 data[dev:0x5005e0000 host:0x7f959fc65010 size:4000000] (line:54 name:a)
alloc done with devptr at 0x5005e0000
pgi_uacc_dataupx(devptr=0x5005e0000,hostptr=0x7f959fc65010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=54,name=a,flags=0x0,threadid=1)
pgi_uacc_cuda_dataup1(devdst=0x5005e0000,hostsrc=0x7f959fc65010,offset=0,stride=1,size=1000000,eltsize=4,lineno=54,name=a,thread=0)
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(lineno=-1,async=-1,dindex=1)
pgi_uacc_cuda_wait(sync on stream=(nil))
pgi_uacc_cuda_wait done
pgi_uacc_launch funcnum=4202208 argptr=(nil) sizeargs=0x7fffeb0f31c0 async=140737137029560 devid=-1
pgi_uacc_dataoff(devptr=0x5005e0000,hostptr=0x7f959fc65010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=60,name=a,flags=0x700=create+present+copyin,threadid=1)
mapped host:0x7f959fc65010 dev:0x5005e0000 offset:0 (host:0x7f959fc65010 dev:0x5005e0000 size:4000000 offset:0 data[host:0x7f959fc65010 dev:0x5005e0000 size:4000000] line:54 name:a)
unmap dev:0x5005e0000 host:0x7f959fc65010 size:4000000 offset:0 data[dev:0x5005e0000 host:0x7f959fc65010 size:4000000] (line:54 name:a)
pgi_uacc_free(ptr=0x5005e0000,devid=1,threadid=1)
save device memory 0x5005e0000(4000256B)
device data 0x5005e0000(4000256B) now available for reuse
pgi_uacc_dataoff(devptr=0x500200000,hostptr=0x7f959f894010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=60,name=r,flags=0xb00=create+present+copyout,threadid=1)
mapped host:0x7f959f894010 dev:0x500200000 offset:0 (host:0x7f959f894010 dev:0x500200000 size:4000000 offset:0 data[host:0x7f959f894010 dev:0x500200000 size:4000000] line:54 name:r)
pgi_uacc_datadownx(devptr=0x500200000,hostptr=0x7f959f894010,poffset=0,offset=0,stride=1,size=1000000,extent=-1,eltsize=4,lineno=60,name=r,flags=0x0,threadid=1)
pgi_uacc_cuda_datadown1(devdst=0x500200000,hostsrc=0x7f959f894010,offset=0,stride=1,size=1000000,eltsize=4,lineno=60,name=r,async=-1,dindex=1)
unmap dev:0x500200000 host:0x7f959f894010 size:4000000 offset:0 data[dev:0x500200000 host:0x7f959f894010 size:4000000] (line:54 name:r)
pgi_uacc_free(ptr=0x500200000,devid=1,threadid=1)
save device memory 0x500200000(4000256B)
device data 0x500200000(4000256B) now available for reuse
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(lineno=-1,async=-1,dindex=1)
pgi_uacc_cuda_wait(sync on stream=(nil))
move (0x7f959f894010 <= 0x202100000, size=4000000)
pgi_uacc_cuda_wait done
acc_c2.exe: acc_c2i.c:71: main: Assertion `__builtin_fabsf(r > - e> ) < 0.000001f’ failed.

And with Bumblebee debug:

_$ optirun -vv ./acc_c2.exe
[25250.596654] [DEBUG]Reading file: /etc/bumblebee/bumblebee.conf
[25250.597434] [DEBUG]optirun version 3.2.1 starting…
[25250.597482] [DEBUG]Active configuration:
[25250.597505] [DEBUG] bumblebeed config file: /etc/bumblebee/bumblebee.conf
[25250.597545] [DEBUG] X display: :8
[25250.597577] [DEBUG] LD_LIBRARY_PATH: /usr/lib/nvidia-current:/usr/lib32/nvidia-current
[25250.597596] [DEBUG] Socket path: /var/run/bumblebee.socket
[25250.597626] [DEBUG] Accel/display bridge: auto
[25250.597656] [DEBUG] VGL Compression: proxy
[25250.597675] [DEBUG] VGLrun extra options:
[25250.597707] [DEBUG] Primus LD Path: /usr/lib/x86_64-linux-gnu/primus:/usr/lib/i386-linux-gnu/primus
[25250.597780] [DEBUG]Using auto-detected bridge virtualgl
[25252.068063] [INFO]Response: Yes. X is active.
[25252.068154] [INFO]Running application using virtualgl.
[25252.068325] [DEBUG]Process vglrun started, PID 11835.
acc_c2.exe: acc_c2i.c:71: main: Assertion `_builtin_fabsf(r > - e> ) < 0.000001f’ failed.
[25252.262336] [DEBUG]SIGCHILD received, but wait failed with No child processes
[25252.262412] [DEBUG]Socket closed.
[25252.262480] [DEBUG]Killing all remaining processes.

I can’t see any suspect line…
Yes, I can run CUDA C successfully (and OpenCL also).

Thank You for your help!

Best regards,
Simi

Hi Simi,

I see two major problems. One, it doesn’t appear that the kernel module is getting loaded (there should be a message towards the top of the debug info starting with “Thread 1 loading module onto device 0” followed by a bunch of config information).

Second, when the run time tries to launch the kernel, since it was never loaded, the launch configuration contains garbage:

pgi_uacc_launch funcnum=4202208 argptr=(nil) sizeargs=0x7fffeb0f31c0 async=140737137029560 devid=-1

It should be more like:

pgi_uacc_launch funcnum=0 argptr=0x7fff5b20c2a0 sizeargs=0x7fff5b20c298 async=-1 devid=1
Arguments to function 0 main_56_gpu dindex=1 threadid=1 device=1:
           1000000      32767    8257536          5    4194304          5          0          0
                 0          0          0          0          0          0          0          0
        0x000f4240 0x00007fff 0x007e0000 0x00000005 0x00400000 0x00000005 0x00000000 0x00000000
        0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000

At this point I have no idea what’s causing this. Is it a Bumblebee issue, PGI issue, pilot error? Sounds like I need get a Bumblebee installation somehow and see if I can recreate your error. Let me see what I can do.

  • Mat

Hi Mat,

Please don’t bother!
I thought it is easier to solve it.
It is not so big problem for me I can use university’s machine instead of my laptop.

Thank You for your help!

Best regards,
Simi