call to '_ZSt20__throw_length_errorPKc' with no acc routine

Hi!I have been trying to accelerate my program which uses OpenCV and PCL libraries.And the code is as follows:

#include <iostream>
#include <string>
using namespace std;

#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>

#include "/usr/include/pcl-1.7/pcl/io/pcd_io.h"
#include "/usr/include/pcl-1.7/pcl/point_types.h"

#include <openacc.h>

typedef pcl::PointXYZRGBA PointT;
typedef pcl::PointCloud<PointT> PointCloud; 

const double camera_factor = 1000;
const double camera_cx = 325.5;
const double camera_cy = 253.5;
const double camera_fx = 518.0;
const double camera_fy = 519.0;

int main( int argc, char** argv )
{
    cv::Mat rgb, depth;
    rgb = cv::imread( "./data/rgb.png" );
    depth = cv::imread( "./data/depth.png", -1 );

    PointCloud::Ptr cloud ( new PointCloud );
  #pragma acc enter data copyin(rgb,depth)
  #pragma acc loop
    for (int m = 0; m < depth.rows; m++)
        for (int n=0; n < depth.cols; n++)
        {   
            ushort d = depth.ptr<ushort>(m)[n];
            if (d == 0)
                continue;
            PointT p;
            p.z = double(d) / camera_factor;
            p.x = (n - camera_cx) * p.z / camera_fx;
            p.y = (m - camera_cy) * p.z / camera_fy;      
            p.b = rgb.ptr<uchar>(m)[n*3];
            p.g = rgb.ptr<uchar>(m)[n*3+1];
            p.r = rgb.ptr<uchar>(m)[n*3+2];
            cloud->points.push_back( p );
        }

    cloud->height = 1;
    cloud->width = cloud->points.size();
    cloud->is_dense = false;
    pcl::io::savePCDFile( "./data/pointcloud.pcd", *cloud );
    cloud->points.clear();
    return 0;
}

When I add “#pragma acc loop” and compile it,it shows that:


PGCC-S-1000-Call in OpenACC region to procedure '_ZSt20__throw_length_errorPKc' which has no acc routine information (generatePointCloud.cpp: 1340)
std::vector<pcl::PointXYZRGBA, Eigen::aligned_allocator_indirection<pcl::PointXYZRGBA>>::_M_check_len(unsigned long, const char *) const:
      5, include "core.hpp"
          49, include "types_c.h"
               58, include "float.h"
                  1896, include "core.hpp"
                         58, include "cstddef"
                              23, include "core.hpp"
                                   63, include "vector"
                                        64, include "stl_vector.h"
                                           1340, Generating implicit acc routine seq
                                           1342, Accelerator restriction: call to '_ZSt20__throw_length_errorPKc' with no acc routine information
                                        69, include "vector.tcc"
PGCC/x86 Linux 15.10-0: compilation completed with severe errors

[/b]
When I replace the directive with"#pragma acc parallel loop",it shows:


      1, include "iostream"
          39, include "ostream"
               38, include "ios"
                    42, include "ios_base.h"
                         41, include "locale_classes.h"
                              40, include "string"
                                   41, include "allocator.h"
                                        46, include "c++allocator.h"
                                             33, include "new_allocator.h"
                                                  33, include "new"
                                                      118, Generating implicit acc routine seq
PGCC-S-0155-Procedures called in a compute region must have acc routine information: __assert_fail (generatePointCloud.cpp: 653)
PGCC-S-0155-Accelerator region ignored; see -Minfo messages  (generatePointCloud.cpp: 28)
main:
     28, Accelerator region ignored
     31, Accelerator restriction: size of the GPU copy of rgbdata,depthdata is unknown
         653, Accelerator restriction: call to '__assert_fail' with no acc routine information
PGCC/x86 Linux 15.10-0: compilation completed with severe errors

[/b]
Then I add the following directive to copy rgbdata and depthdata to GPU.


#pragma acc enter data copyin(rgb,depth,camera_factor,camera_fx,camera_fy)
 #pragma acc loop

but it still shows:

PGCC-S-1000-Call in OpenACC region to procedure '_ZSt20__throw_length_errorPKc' which has no acc routine information (generatePointCloud.cpp: 1340)
std::vector<pcl::PointXYZRGBA, Eigen::aligned_allocator_indirection<pcl::PointXYZRGBA>>::_M_check_len(unsigned long, const char *) const:
      5, include "core.hpp"
          49, include "types_c.h"
               58, include "float.h"
                  1896, include "core.hpp"
                         58, include "cstddef"
                              23, include "core.hpp"
                                   63, include "vector"
                                        64, include "stl_vector.h"
                                           1340, Generating implicit acc routine seq
                                           1342, Accelerator restriction: call to '_ZSt20__throw_length_errorPKc' with no acc routine information
                                        69, include "vector.tcc"
PGCC/x86 Linux 15.10-0: compilation completed with severe errors

[/b]
Because my program has to use the libraries OpenCV and PCL,I don’t know if the problem resulted from the libraries?I have been struggling with this problem for a long time,hope you can help me,thanks![/quote]

Hi plinzhuo11253,

Unfortunately, you’ll have a lot of problems trying to accelerate this code.

First, all routines called from device code, need to have device version available. The PGI compiler will attempt to automatically generate such code provided that the routine’s definition is visible.

Here, the compiler is automatically creating a routine found in the STL vector header file, but can’t create one to the routines it calls:

   64, include "stl_vector.h" 
                                           1340, Generating implicit acc routine seq 
                                           1342, Accelerator restriction: call to '_ZSt20__throw_length_errorPKc' with no acc routine information

Given the name, I’m assuming this is an exception handler routine. Exception handling isn’t supported in device code.

Another limitation in OpenACC is that aggregate data structures with dynamic data members are not supported. The problem being that the compiler has no way of knowing the size or shape of the dynamic data structure so can’t automatically create or update such a structure on the device. The OpenACC standards committee has been investigating solutions, but it’s a very difficult problem and one not easily solved.

Currently, you will need to manage the data structures yourself by adding methods to create the data structure on the device as well as synchronize the data between the host and device.

Alternatively, you can try using CUDA Unified Memory by compiling with “-ta=tesla:managed”. There are limitations such as that only dynamic memory is managed so you still need to manage static memory yourself and your whole program can only use as much memory as available on your device.

It might be helpful to watch a presentation I did at GTC2015: C++ Class Management with OpenACC 2.0; Mat Colgrove - YouTube. It covers several of these issues.

Hope this helps,
Mat

Thanks for your early reply! :-)
I have been learning from the website that you have suggested:
[u]C++ Class Management with OpenACC 2.0; Mat Colgrove - YouTube
and have downloaded the samples GTC2015_S5233.tar referred,then I compiled the code without revision(except that I changed the"-ta=tesla:cuda6.5" to “-ta=tesla:cuda7.0” in Makefile),but it showed an error:

call to cuModuleLoadData returned error 209: No binary for GPU

I have seen a similar question on the forum where you mentioned that PGI Accelerator compilers target all NVIDIA Tesla GPU accelerators,my GPU information is:

plz@plz:~$ pgaccelinfo

CUDA Driver Version:           7050
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015

Device Number:                 0
Device Name:                   GeForce GTX 650
Device Revision Number:        3.0
Global Memory Size:            2146762752
Number of Multiprocessors:     2
Number of SP Cores:            384
Number of DP Cores:            128
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1071 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2500 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 262144 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc30

I don’t know whether it is supported by PGI FORTRAN/C/C++ WORKSTATION?My PGI version is 15.10.
Thanks!

You probably left in the “cc35” option as well. Since your device is Compute Capability 3.0, a “cc35” binary wont run. Either remove “cc35” or change it to “cc30”.

  • Mat

Thanks for your reply! :-)
Indeed,after I have changed cc35 to cc30,it compiled successfully~
Here I hope you can suggest me some websites or documents where I can learn how to manage aggregate data structures with dynamic data members on the device,because my program will also use the std::vector,I have found an useful report about it from your video on youtube,thank you very much. I’m really eager for some other similar documents!

because my program will also use the std::vector

The current OpenACC 2.5 standard lists this under the topics deferred for a future revision. Hopefully in the 3.0 standard, but it’s a very difficult problem so we’ll need to wait and see.

For the time being, you’ll need to use CUDA Unified Memory.

Note that std::vector is not thread-safe so be careful when using it in a parallel context. In particular, make sure the vector doesn’t get resized. Meaning that you can’t call the routines push_back, pop_back or insert.

  • Mat