Concurrent nvrtc ptx compiling and runtime linking

How should I go about compiling and linking multiple cuda kernels concurrently at runtime with nvrtc and cuLinkAddData?

I’m using a shared context between different threads and it seems to be serializing everything (So each nvrtc compilation/linking takes 70ms, 100 concurrently takes 7.5s on a 24 core system)

Making a context for each thread seems to be a bad idea even with 10 threads.

I’m doing all this through an FFI, so I might be doing something wrong, but wanted to ask what I should do operationally first.