I tried both setting NV_ACC_CUDA_HEAPSIZE=512MB
and adding -Mlarge_arrays
, or both, but the error remains.
using compute-sanitizer, I saw the following curious error message (repeated many times)
========= Invalid __global__ write of size 1 bytes
========= at double MCX_kernel_726_gpu<(bool)0, (bool)1>(nlohmann::json_abi_v3_11_3::basic_json<nlohmann::json_abi_v3_11_3::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char>>, void> &, const MCX_param &, MCX_volume<int> &, MCX_volume<float>&, float4 *, MCX_medium *, MCX_detect &)+0x660 in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:726
========= by thread (3,0,0) in block (386,0,0)
========= Address 0x743c03943a30 is out of bounds
========= and is 226,865 bytes after the nearest allocation at 0x743c038fc000 of size 66,560 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x33255f]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cuda_launch3 in ../../src/cuda_launch.c:822 [0x98537]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:__pgi_uacc_cuda_launch3 in ../../src/cuda_launch.c:1015 [0x9a789]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:__pgi_uacc_cuda_launch in ../../src/cuda_launche.c:169 [0xbbed0]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:__pgi_uacc_launch in ../../src/launch.c:54 [0xa3994]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:double MCX_kernel<false, true>(nlohmann::json_abi_v3_11_3::basic_json<nlohmann::json_abi_v3_11_3::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void>&, MCX_param const&, MCX_volume<int>&, MCX_volume<float>&, float4*, MCX_medium*, MCX_detect&) in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:726 [0xe2da]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:MCX_run_simulation(char**, int) in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:801 [0x8dba]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:main in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:829 [0x14054]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d8f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in ../csu/libc-start.c:392 [0x29e3f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x6ca4]
========= in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 719 (CUDA_ERROR_LAUNCH_FAILED): Launch failed (often invalid pointer dereference)
File: /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp
Function: _Z10MCX_kernelILb0ELb1EEdRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERK9MCX_paramR10MCX_volumeIiERSJ_IfEP6float4P10MCX_mediumR10MCX_detect:707
Line: 755
========= Target application returned an error
========= ERROR SUMMARY: 57282 errors
========= ERROR SUMMARY: 57182 errors were not printed. Use --print-limit option to adjust the number of printed errors
it points to a global memory write error (1 byte) at line#726 of my code.
interestingly, when this line was called previously in this previous post, your fix was related to defining int ppathlen
. I wonder if this thread issue has any connection to that previous issue.
anyways, it is a minor problem. I only bring it up here in case it is helpful for debugging the compiler.
I know I have asked a lot of questions (and learned a lot from you on nvc and acc), I have two minor questions and would like to get your opinions
- what is the best way to produce a portable binary that can be executed across multiple generations of GPUs for deployment? With my cuda code, I have been using
-arch=sm_30
or whatever that is the lowest arch supported by the cuda version.
- If I also want the binary to be independent of the local GLIBC version, is there a flag to statically link libgcc and libstdc++? Iāve already used
-static-nvidia
. When I add -static
to nvc linker, it wonāt link with libcuda.