OpenCL on Linux woes

gpuxplorer · March 22, 2017, 12:41pm

Hi *,

developing OpenCL on Linux (on NVIDIA) certainly feels like Cinderellas’ bad part of life.
No IDE, no Debugger, no nothing.(*) Dancing blindfolded in a mine field.

NVIDIA certainly makes it abundantly clear, OpenCL is an unwanted child.
Well - I’m stuck with a Nvidia GPU in my notebook and I certainly cannot tear it out and replace by something else. I also most certainly will NOT program in CUDA, as I need a cross-vendor GPU accelerated computing executable.

So what are my options? The forum. ;-)

Let’s start with my normal mode of R&D operation. I use emacs for typing in the program, then I use make and gcc to translate the program. Because I make no programming errors :-D, all works fine and happily ever after - I do not even need a gdb.

Enter Nvidia-OpenCL:

*** Error in `oclprog': double free or corruption (out): 0x000000000333a440 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f164bc492ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f164bc4f90e]
/usr/lib/libc.so.6(+0x7911e)[0x7f164bc5011e]
/usr/lib/libnvidia-opencl.so.1(+0xd11b0)[0x7f162b0b81b0]
/usr/lib/libnvidia-opencl.so.1(+0xb59f5)[0x7f162b09c9f5]
/usr/lib/libnvidia-opencl.so.1(+0xb5dd7)[0x7f162b09cdd7]
/usr/lib/libnvidia-opencl.so.1(+0xd78ba)[0x7f162b0be8ba]
/usr/lib/libnvidia-opencl.so.1(+0xce250)[0x7f162b0b5250]
/usr/lib/libnvidia-opencl.so.1(+0xce6d0)[0x7f162b0b56d0]
/usr/lib/libOpenCL.so.1(clEnqueueNDRangeKernel+0x62)[0x7f164bf836d2]
oclprog[0x401b8e]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7f164bbf7511]
oclprog[0x4021aa]
======= Memory map: ========
00400000-0043b000 r-xp 00000000 103:05 5386509

Sometimes it’s the libnvidia-opencl, sometimes it’s the compiler, sometimes it’s the “double free or corruption” sometimes it’s a “corrupted double linked list”. As I said: minefield.

Yes, the hardware (Quadro M2000M in a Lenovo P50) is ok. Yes, the program is ok. Actually everything was ok until the kernel grew a little bit more complex a week ago. Then both compiler and executables started to exhibit very weird behavior. I guess I’m hitting some hidden limit but becuase there is no sane error message I cannot be sure.

My software may be a little bit recent:

CUDA 8.0.61
gcc 6.3.1
glibc 2.25
nvidia 378.13
@Arch Linux

Now I wonder: What can I do to exit this nightmare called OpenCL@NVIDIA development I’ve been experiencing for the past week? And so you know what I mean with nightmare:

if (memcmp((uchar*)sha256_in32,(uchar*)sha256_in64)) {
    //printf("Not same!\n");
  }

  sha256_u(sha256_in64, sha256_out);
  ripemd160_transform(sha256_out, ripemd160_out);

  if (bloom_chk_hash160(bloom, ripemd160_out)) {
     printfind(ripemd160_out, 'u', privkey, idx);
  }

works. If I remove the printf comment → segmentation fault. (other printfs work)

If I do

if (memcmp((uchar*)sha256_in32,(uchar*)sha256_in64)) {
    //printf("Not same!\n");

    sha256_u(sha256_in64, sha256_out);
    ripemd160_transform(sha256_out, ripemd160_out);

    if (bloom_chk_hash160(bloom, ripemd160_out)) {
       printfind(ripemd160_out, 'u', privkey, idx);
    }
  }

segmentation fault. If I do

if (memcmp((uchar*)sha256_in32,(uchar*)sha256_in64)) {
    //printf("Not same!\n");
  }
  else {
    sha256_u(sha256_in64, sha256_out);
    ripemd160_transform(sha256_out, ripemd160_out);

    if (bloom_chk_hash160(bloom, ripemd160_out)) {
       printfind(ripemd160_out, 'u', privkey, idx);
    }
  }

segmentation fault. It’s a friggin’ nightmare!

gpuxplorer · March 22, 2017, 1:09pm

removing the comment from the printf “Not same!”

*** Error in `oclprog': corrupted double-linked list: 0x00000000031ef690 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f5ae48c22ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f5ae48c890e]
/usr/lib/libc.so.6(+0x7aec0)[0x7f5ae48caec0]
/usr/lib/libc.so.6(__libc_malloc+0x54)[0x7f5ae48cc674]
/usr/lib/libc.so.6(_IO_file_doallocate+0x8c)[0x7f5ae48b7b6c]
/usr/lib/libc.so.6(_IO_doallocbuf+0x46)[0x7f5ae48c6286]
/usr/lib/libc.so.6(_IO_file_overflow+0x1d8)[0x7f5ae48c5578]
/usr/lib/libc.so.6(_IO_file_xsputn+0xb6)[0x7f5ae48c4636]
/usr/lib/libc.so.6(_IO_vfprintf+0x10f)[0x7f5ae489891f]
/usr/lib/libc.so.6(_IO_printf+0xa6)[0x7f5ae48a0ea6]
oclprog[0x401cf4]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7f5ae4870511]
oclprog[0x4021aa]

Where can I get OpenCL 2.0 ?

gpuxplorer · March 22, 2017, 1:42pm

The libnvidia-compiler also wants its own share:

*** Error in `oclprog': corrupted double-linked list: 0x0000000002b79990 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f921d0622ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f921d06890e]
/usr/lib/libc.so.6(+0x78c9c)[0x7f921d068c9c]
/usr/lib/libc.so.6(+0x797a0)[0x7f921d0697a0]
/usr/lib/libnvidia-compiler.so.378.13(+0xabe2c7)[0x7f91e5a6d2c7]
/usr/lib/libnvidia-compiler.so.378.13(+0x202095)[0x7f91e51b1095]
/usr/lib/libnvidia-compiler.so.378.13(+0x2021a9)[0x7f91e51b11a9]
/usr/lib/libnvidia-compiler.so.378.13(+0x23c9bd)[0x7f91e51eb9bd]
/usr/lib/libnvidia-compiler.so.378.13(+0x269c60)[0x7f91e5218c60]
/usr/lib/libnvidia-compiler.so.378.13(+0x269f68)[0x7f91e5218f68]
/usr/lib/libnvidia-compiler.so.378.13(+0x25ffd1)[0x7f91e520efd1]
/usr/lib/libnvidia-compiler.so.378.13(+0x25ed91)[0x7f91e520dd91]
/usr/lib/libnvidia-compiler.so.378.13(+0x25f48e)[0x7f91e520e48e]
/usr/lib/libnvidia-compiler.so.378.13(+0x2c4eed)[0x7f91e5273eed]
/usr/lib/libnvidia-compiler.so.378.13(+0x2c4f15)[0x7f91e5273f15]
/usr/lib/libc.so.6(+0x366c0)[0x7f921d0266c0]
/usr/lib/libc.so.6(+0x3671a)[0x7f921d02671a]
/usr/lib/libc.so.6(__libc_start_main+0xf8)[0x7f921d010518]

pyopyopyo · March 22, 2017, 2:16pm

I think memcmp() requires three arguments, but your code specifies only two.
It might cause undefined behaviors.

Did you check your GCC’s warnings or errors?

birdie · March 22, 2017, 7:10pm

Doesn’t memcmp() require three arguments which renders all your rant slightly invalid?

You might want to build your future applications with: -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes

and also -Wpedantic with -stdXXX

and also -Werror -pedantic-errors

gpuxplorer · March 27, 2017, 11:50am

The problem - guys - is the printf.

If I throw out the memcmp completely, and leave in the OpenCL code more than one printf → boom

This is nowhere the rant it should be.

@birdie:
Thanks for the GCC option suggestions, but you are certainly aware I was posting OpenCL code (which isn’t compiled by gcc)?

On Tesla K80, even a single printf in the OpenCL code crashes the app.

Upon 1st invocation:

corrupted double-linked list: 0x00000000026d8680 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f222f3097e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x81f88)[0x7f222f313f88]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f222f3155d4]
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1(+0x269483)[0x7f220e8ab483]
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1(+0x2fff33)[0x7f220e941f33]
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1(+0xcf385)[0x7f220e711385]
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1(+0xc5e0d)[0x7f220e707e0d]
./oclprog[0x402583]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f222f2b2830]
./oclprog[0x402b59]
...

(On Maxwell and Pascal chips this works - actually.)

Any subsequent invocation:

oclprog: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)

Now the magic: Remove the single “printf” in the OpenCL code:

tadaaa. The OpenCL prog runs - albeit with no output.

Sidenote: these gcc compiler options were there from the dawn of time

-Wall -Wextra -Wno-pointer-sign -Wno-sign-compare -pedantic -std=gnu99

but that’s irrelevant. The crash happens when the OpenCL app tries to printf something.

Funny thing is, the app is not even threaded (does not use nptl) , so that pthread_mutex_lock message comes from some of the linked libs

linux-vdso.so.1 =>  (0x00007ffe355df000)
        libOpenCL.so.1 => /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 (0x00007f0201073000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0200caa000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0200aa5000)
        /lib64/ld-linux-x86-64.so.2 (0x000055e61a026000)

chemal · March 27, 2017, 4:56pm

The latest supported compiler ist gcc 5.3. See also here: [url]https://devtalk.nvidia.com/default/topic/949770/cuda-8-0rc-supporting-gcc6-/[/url]

Topic		Replies	Views
NVIDIA OpenCL SDK deployment so 90ies CUDA Setup and Installation	1	715	November 5, 2016
clcc - an NVIDIA OpenCL command line compiler CUDA Programming and Performance	8	10064	November 1, 2012
nvcc Fornicate Under the Consent of the King under windows just want to compile GPU code, but fails CUDA Programming and Performance	25	3115	October 29, 2010
OpenCL and Ubuntu 10.10 CUDA Programming and Performance	7	80063	January 25, 2011
Can we run NVidai OpenCL samples on CPU (AMD/Intel) Are the Nvidia opencl samples runs on CPU? CUDA Programming and Performance	10	5976	May 10, 2012
Opencl not working with kernel 5.9 Linux	9	6687	November 13, 2020
OpenCL or CUDA? CUDA Programming and Performance	16	10938	October 26, 2011
OpenCL example code doesn't compile (CUDA 6.0 + Ubuntu 12.04.5) CUDA Setup and Installation	9	7248	August 16, 2017
Regression? NVIDIA OpenCL ICD stops working in Ubuntu 22.04 CUDA Programming and Performance ubuntu , opencl , driver	3	3119	April 19, 2023
Editing a post in OpenCL sdk who does compile but does not work CUDA Programming and Performance	7	9022	February 3, 2010

OpenCL on Linux woes

Related topics