Arrayfire lib performance among L4T/cuda new versions

Hi,

Could you help to provide complete failure logs?

Hi Honey_Patouceul,

I setting AF_CUDA_MAX_JIT_LEN value to 19, 20 and 30, the result are passed.
Test on JetPack3.2 DP/TX2.

nvidia@tegra-ubuntu:~/arrayfire/build$ ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
[       OK ] JIT.ISSUE_1646 (3834 ms)
[----------] 1 test from JIT (3834 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3835 ms total)
[  PASSED  ] 1 test.

Hi AastaLLL and carolyu,

Thanks for your help and happy new year.

That’s interesting…in my case I get:

head -1 /etc/nv_tegra_release 
# R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec  1 14:20:33 UTC 2017

/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery 
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7851 MBytes (8232407040 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1301 MHz (1.30 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS


./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)


In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[  FAILED  ] JIT.ISSUE_1646 (3129 ms)
[----------] 1 test from JIT (3130 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3130 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] JIT.ISSUE_1646

 1 FAILED TEST

So the question is why it fails on my TX1/TX2 but it succeeds on yours ?
May you attach your arrayfire build CMakeCache.txt so that I can check if an option is different ?

Thanks again

Hi Honey_Patouceul,

Attached my CMakeCache.txt for you check.
CMakeCache.txt (56.9 KB)

Thanks for providing this. Indeed, and that’s good news, the Release build works.
You should be able to reproduce the error with CMAKE_BUILD_TYPE=Debug or default (undefined).
Is -O3 flag mandatory for properly using cuda JIT ?

Seems not related to optimization, but rather linked to option “–device-debug” being passed to nvrtcCompileProgram(). Commenting line 286 in src/backend/cuda/jit.cpp allows a debug build to pass the test.

Hi,

Thanks for looking into this.
Here is the nvrtc document: http://docs.nvidia.com/cuda/nvrtc/index.html

It looks like the –device-debug flag is to output some debug information only.
We will ask more information from our internal team and update to you.

Thanks

Using the debug build on my TX2-R28.2-DP with cuda-gdb I see this warning before failure:

cuda-gdb test/jit_cuda 
NVIDIA (R) CUDA Debugger
9.0 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test/jit_cuda...done.
(cuda-gdb) set args --gtest_filter=JIT.ISSUE_1646
(cuda-gdb) run
Starting program: /media/nvidia/Data/arrayfire/github/build/TX2-R28.2-DP_Debug/test/jit_cuda --gtest_filter=JIT.ISSUE_1646
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
[New Thread 0x7f93b35240 (LWP 19089)]
<b>warning: Cuda API error detected: cuLaunchKernel returned (0x2bd)</b>

unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)

In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[  FAILED  ] JIT.ISSUE_1646 (57409 ms)
[----------] 1 test from JIT (57410 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (57411 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] JIT.ISSUE_1646

 1 FAILED TEST
[Thread 0x7f93b35240 (LWP 19089) exited]
[Inferior 1 (process 19078) exited with code 01]

What does 0x2bd return value mean for cuLaunchKernel ? CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701 ?

Hi, Honey

Thanks for this information. Have passed this log to the internal developer.
We are still checking this issue and will update comment here once we have a further suggestion.

Thanks.

Hi,

Error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, indicating the application may run out of resources.
Guess that compiler ends up using more registers in --device-debug path(due to un-optimized code) and if the application launches more threads program may fail.

To give further suggestion, could you help to provide the following logs?
1. Call nvrtcGetPTX() in your application
2. Collect the compute cache

rm -rf ~/.nv/ComputeCache
export CUDA_​FORCE_​PTX_​JIT=1 
run the application
zip/tar ~/.nv/ComputeCache

Thanks.

Hi AastaLLL,

Thanks for your support.
Attached is the log for AF_CUDA_MAX_JIT_LEN set to 19 (Note that default value is 100).
Let me know if you need more experiments.
jit.issue_1646.log (85.6 KB)
ComputeCache.tar.gz (31.9 KB)

Thanks.

Will update information with you later.

Hi,

Please help to provide the following information:

1. nvrtcGetPTX() log of successful case
2. How many threads/block is launched?

Thanks.

Hi,

Sorry I’ve modified source for tracking another problem.
So here are the logs for working (18) and failing(19) values for AF_CUDA_MAX_JIT_LEN with a new build.

Seems the working version log shows much bigger debug_info section than failing one, but not sure how to interpret it, as in working case, five kernels are launched.

Thanks.
jit_issue_working_18.log (393 KB)
ComputeCache_working_18.tar.gz (27.7 KB)
jit_issue_failing_19.log (85.6 KB)
ComputeCache_failing_19.tar.gz (31.9 KB)

Thanks.
Will update information with you later.

Hi,
Any update on this topic ?
Should that be fixed in next release ?

Thanks.

Hi, Honey_Patouceul

We are still checking the root cause.
This issue is feedbacked to our CUDA driver team and is prioritized internally.

Thanks.

Here is our internal update:

Please ask the application writer to review launch bounds doc here:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds

If the application is specifying launch bounds correctly and any launches the actual number of threads that matches launch bounds, then it should not hit the OUT_OF_RESOURCES error.

Thanks.

Hi AastaLLL,

Thanks for your update. However, I’m unsure it does answer to this problem.
The number of threads for arrayfire JIT kernels is fixed (256 or 1024).

Maybe it is using too much constant memory (which is used to store the kernel parameters).
Considering that it works in release mode but not in debug mode (with --device-debug), is it putting additional stuff in constant memory during debug mode which is reducing the amount of available space for kernel parameters ?
Is there a way I can know before call how much would be available for kernel parameters ?

Thanks.

Hi,

As comment #17, error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES so most likely code might be running out of resources.
One possible is that compiler ends up using more registers in debug path and if code launches more threads program may fail.

Available registers can be checked with CUDA deviceQuery sample:

Total number of registers available per block: 65536

Used registers of application can be profiled via NVVP.

Thanks.