Hi everyone!
EngineUKFTs.mk (29.1 KB)
I’m testing my algorithm by using OpenMp on Jetson TX2 CPUs. But when I run the Executable file(.elf), I found it creates multi-threads but only run on one cpu cores. The testing code is simple(sigmaMeasurement is a matrix based on Eigen):

#pragma omp parallel for
for(int i = 0;i<9;i++)
    sigmaMeasurement.col(i) = func(sigmaX.col(i));

I added -fopenmp with Xcompile and -lgomp with Xlinker, it can compile and generate successful, but couldnot speed up. It seem like there’s an setting in my .mk file for this algorithm is not correct, because i wrote a simple demo with the same construct and compile with g++ -fopenmp… and get correct results run on 4 cores. Does anyone know how that happen?
Here’s my .mk setting, and i uploaded the file in begin.

# C Compiler: NVCC for NVIDIA Embedded Processors1.0 NVIDIA CUDA C Compiler Driver
CC = nvcc
# Linker: NVCC for NVIDIA Embedded Processors1.0 NVIDIA CUDA C Linker
LD = nvcc
# C++ Compiler: NVCC for NVIDIA Embedded Processors1.0 NVIDIA CUDA C++ Compiler Driver
CPP = nvcc
# C++ Linker: NVCC for NVIDIA Embedded Processors1.0 NVIDIA CUDA C++ Linker
CPP_LD = nvcc
# Archiver: NVCC for NVIDIA Embedded Processors1.0 Archiver
AR = ar
# MEX Tool: MEX Tool
MEX = $(MEX_PATH)/mex
# Download: Download
# Execute: Execute
# Builder: Make Tool
MAKE = make
ARFLAGS              = -ruvs
CFLAGS               = -rdc=true -Xcudafe "--diag_suppress=unsigned_compare_with_zero" \
                       -c \
                       -Xcompiler -MMD,-MP,-fopenmp \
CPPFLAGS             = -rdc=true -Xcudafe "--diag_suppress=unsigned_compare_with_zero" \
                       -c \
                       -Xcompiler -fopenmp,-MMD,-MP \
CPP_LDFLAGS          = -lm -lrt -ldl \
                       -Xlinker -lgomp,-rpath,/usr/lib32 -Xnvlink -w -lcudart -lcuda -Wno-deprecated-gpu-targets
                         -lm -lrt -ldl \
                         -Xlinker -lgomp,-rpath,/usr/lib32 -Xnvlink -w -lcudart -lcuda -Wno-deprecated-gpu-targets
LDFLAGS              = -lm -lrt -ldl \
                       -Xlinker -lgomp,-rpath,/usr/lib32 -Xnvlink -w -lcudart -lcuda -Wno-deprecated-gpu-targets
MEX_CPPFLAGS         =
MEX_CFLAGS           =
MEX_LDFLAGS          =
MAKE_FLAGS           = -f $(MAKEFILE)
SHAREDLIB_LDFLAGS    = -shared  \
                       -lm -lrt -ldl \
                       -Xlinker -lgomp,-rpath,/usr/lib32 -Xnvlink -w -lcudart -lcuda -Wno-deprecated-gpu-targets

Could you share a complete source and reproduce steps so we can check it directly?

Thanks!I‘d like to share my complete source code, but there’s Confidential files in some .cpp as Engine parameters, and i can’t do that.
Here’s the detail using in code.

class UKF {
        template<class Measurement, template<class> class CovarianceBase>
        void computeSigmaPointMeasurementsIter(MeasurementModelType<Measurement, CovarianceBase>& m, SigmaPoints<Measurement>& sigmaMeasurementPoints)
            #pragma omp parallel
                #pragma omp for
                for (int i = 0; i < SigmaPointCount; ++i)
                 sigmaMeasurementPoints.col(i) = m.h(sigmaStatePointsIter.col(i));
                    printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
            std::cout << "MultiProcessor sigmaMeasurementPoints = " << sigmaMeasurementPoints << std::endl;

We want to reproduce this issue internally first.
Would you mind wrapping the above function to a compilable source?


