Hi,
Could you help to provide complete failure logs?
Hi,
Could you help to provide complete failure logs?
Hi Honey_Patouceul,
I setting AF_CUDA_MAX_JIT_LEN value to 19, 20 and 30, the result are passed.
Test on JetPack3.2 DP/TX2.
nvidia@tegra-ubuntu:~/arrayfire/build$ ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
[ OK ] JIT.ISSUE_1646 (3834 ms)
[----------] 1 test from JIT (3834 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3835 ms total)
[ PASSED ] 1 test.
Hi AastaLLL and carolyu,
Thanks for your help and happy new year.
That’s interesting…in my case I get:
head -1 /etc/nv_tegra_release
# R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec 1 14:20:33 UTC 2017
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7851 MBytes (8232407040 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1301 MHz (1.30 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)
In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[ FAILED ] JIT.ISSUE_1646 (3129 ms)
[----------] 1 test from JIT (3130 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3130 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] JIT.ISSUE_1646
1 FAILED TEST
So the question is why it fails on my TX1/TX2 but it succeeds on yours ?
May you attach your arrayfire build CMakeCache.txt so that I can check if an option is different ?
Thanks again
Thanks for providing this. Indeed, and that’s good news, the Release build works.
You should be able to reproduce the error with CMAKE_BUILD_TYPE=Debug or default (undefined).
Is -O3 flag mandatory for properly using cuda JIT ?
Seems not related to optimization, but rather linked to option “–device-debug” being passed to nvrtcCompileProgram(). Commenting line 286 in src/backend/cuda/jit.cpp allows a debug build to pass the test.
Hi,
Thanks for looking into this.
Here is the nvrtc document: http://docs.nvidia.com/cuda/nvrtc/index.html
It looks like the –device-debug flag is to output some debug information only.
We will ask more information from our internal team and update to you.
Thanks
Using the debug build on my TX2-R28.2-DP with cuda-gdb I see this warning before failure:
cuda-gdb test/jit_cuda
NVIDIA (R) CUDA Debugger
9.0 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test/jit_cuda...done.
(cuda-gdb) set args --gtest_filter=JIT.ISSUE_1646
(cuda-gdb) run
Starting program: /media/nvidia/Data/arrayfire/github/build/TX2-R28.2-DP_Debug/test/jit_cuda --gtest_filter=JIT.ISSUE_1646
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
[New Thread 0x7f93b35240 (LWP 19089)]
<b>warning: Cuda API error detected: cuLaunchKernel returned (0x2bd)</b>
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)
In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[ FAILED ] JIT.ISSUE_1646 (57409 ms)
[----------] 1 test from JIT (57410 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (57411 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] JIT.ISSUE_1646
1 FAILED TEST
[Thread 0x7f93b35240 (LWP 19089) exited]
[Inferior 1 (process 19078) exited with code 01]
What does 0x2bd return value mean for cuLaunchKernel ? CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701 ?
Hi, Honey
Thanks for this information. Have passed this log to the internal developer.
We are still checking this issue and will update comment here once we have a further suggestion.
Thanks.
Hi,
Error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, indicating the application may run out of resources.
Guess that compiler ends up using more registers in --device-debug path(due to un-optimized code) and if the application launches more threads program may fail.
To give further suggestion, could you help to provide the following logs?
1. Call nvrtcGetPTX() in your application
2. Collect the compute cache
rm -rf ~/.nv/ComputeCache
export CUDA_FORCE_PTX_JIT=1
run the application
zip/tar ~/.nv/ComputeCache
Thanks.
Hi AastaLLL,
Thanks for your support.
Attached is the log for AF_CUDA_MAX_JIT_LEN set to 19 (Note that default value is 100).
Let me know if you need more experiments.
jit.issue_1646.log (85.6 KB)
ComputeCache.tar.gz (31.9 KB)
Thanks.
Will update information with you later.
Hi,
Please help to provide the following information:
1. nvrtcGetPTX() log of successful case
2. How many threads/block is launched?
Thanks.
Hi,
Sorry I’ve modified source for tracking another problem.
So here are the logs for working (18) and failing(19) values for AF_CUDA_MAX_JIT_LEN with a new build.
Seems the working version log shows much bigger debug_info section than failing one, but not sure how to interpret it, as in working case, five kernels are launched.
Thanks.
jit_issue_working_18.log (393 KB)
ComputeCache_working_18.tar.gz (27.7 KB)
jit_issue_failing_19.log (85.6 KB)
ComputeCache_failing_19.tar.gz (31.9 KB)
Thanks.
Will update information with you later.
Hi,
Any update on this topic ?
Should that be fixed in next release ?
Thanks.
Hi, Honey_Patouceul
We are still checking the root cause.
This issue is feedbacked to our CUDA driver team and is prioritized internally.
Thanks.
Here is our internal update:
Please ask the application writer to review launch bounds doc here:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds
If the application is specifying launch bounds correctly and any launches the actual number of threads that matches launch bounds, then it should not hit the OUT_OF_RESOURCES error.
Thanks.
Hi AastaLLL,
Thanks for your update. However, I’m unsure it does answer to this problem.
The number of threads for arrayfire JIT kernels is fixed (256 or 1024).
Maybe it is using too much constant memory (which is used to store the kernel parameters).
Considering that it works in release mode but not in debug mode (with --device-debug), is it putting additional stuff in constant memory during debug mode which is reducing the amount of available space for kernel parameters ?
Is there a way I can know before call how much would be available for kernel parameters ?
Thanks.
Hi,
As comment #17, error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES so most likely code might be running out of resources.
One possible is that compiler ends up using more registers in debug path and if code launches more threads program may fail.
Available registers can be checked with CUDA deviceQuery sample:
Total number of registers available per block: 65536
Used registers of application can be profiled via NVVP.
Thanks.