How should I go about compiling and linking multiple cuda kernels concurrently at runtime with nvrtc and cuLinkAddData?
I’m using a shared context between different threads and it seems to be serializing everything (So each nvrtc compilation/linking takes 70ms, 100 concurrently takes 7.5s on a 24 core system)
Making a context for each thread seems to be a bad idea even with 10 threads.
I’m doing all this through an FFI, so I might be doing something wrong, but wanted to ask what I should do operationally first.