Arrayfire lib performance among L4T/cuda new versions

Hi,

Sorry I’ve modified source for tracking another problem.
So here are the logs for working (18) and failing(19) values for AF_CUDA_MAX_JIT_LEN with a new build.

Seems the working version log shows much bigger debug_info section than failing one, but not sure how to interpret it, as in working case, five kernels are launched.

Thanks.
jit_issue_working_18.log (393 KB)
ComputeCache_working_18.tar.gz (27.7 KB)
jit_issue_failing_19.log (85.6 KB)
ComputeCache_failing_19.tar.gz (31.9 KB)

Thanks.
Will update information with you later.

Hi,
Any update on this topic ?
Should that be fixed in next release ?

Thanks.

Hi, Honey_Patouceul

We are still checking the root cause.
This issue is feedbacked to our CUDA driver team and is prioritized internally.

Thanks.

Here is our internal update:

Please ask the application writer to review launch bounds doc here:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds

If the application is specifying launch bounds correctly and any launches the actual number of threads that matches launch bounds, then it should not hit the OUT_OF_RESOURCES error.

Thanks.

Hi AastaLLL,

Thanks for your update. However, I’m unsure it does answer to this problem.
The number of threads for arrayfire JIT kernels is fixed (256 or 1024).

Maybe it is using too much constant memory (which is used to store the kernel parameters).
Considering that it works in release mode but not in debug mode (with --device-debug), is it putting additional stuff in constant memory during debug mode which is reducing the amount of available space for kernel parameters ?
Is there a way I can know before call how much would be available for kernel parameters ?

Thanks.

Hi,

As comment #17, error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES so most likely code might be running out of resources.
One possible is that compiler ends up using more registers in debug path and if code launches more threads program may fail.

Available registers can be checked with CUDA deviceQuery sample:

Total number of registers available per block: 65536

Used registers of application can be profiled via NVVP.

Thanks.