I am using CUDA 9.0 with bazel. Getting build errors when bazel is run with --jobs 24 (more than 8). It’s failing with error /tmp/tmp_XXXX_XXXX (random error). Are threre any restrictions on the number of instances nvcc?
I am not aware of limits on the number of concurrently running nvcc instances, but it seems plausible that with 24 concurrent compilations your build process may run out of resources somewhere. Have you monitored resource usage for these runs, including in particular the /tmp file system (may be mapped to a partition with limited space).
Can you provide a verbatim and complete copy of such an error message? From the information you provided, it is not even clear that the issue is directly related to nvcc; the temporary files mentioned could be used elsewhere in the bazel build system.
The error is some thing like nvcc fatal can’t access /tmp/tmpxft_00000004_00000008-4_block_prefix_sum_test.cu.cpp1.ii and it is very random and varies (file name) a lot. I have tried with TMPDIR set to different directories but still the error happens