Internal error: CASK: all shaders must have unique names

TensorRT 6.0 has been released with some fixes for multi-threading multiple engines. Please try the newest release and let us know if the issue has been resolved.

The expected behavior by application however is one ExecutionContext per cpu thread.

1 Like

It seems that in the multi-threading some layers in one engine (especially convolution) are blocking other engine. Is it something related to MPS or something else?

Can you provide the tactic ID’s of the layers you believe are misbehaving from the Verbose log? We can have an engineer look into this in more detail.

Hi I just profiled with nvvp and saw the phenomenon. This happens for almost all convolutions.
Here is a more detail of my program:

  1. Create multiple threads:
  2. Create context and the engine in each thread
  3. Run enqueue in each thread.

BTW, I am using TensorRT 5.1.5. I don’t know how to get tactic ID from nvvp.

There are issues that are known to have occurred in 5.1 that fixed in 6.0, we just need to verify that you are no longer seeing those issues, otherwise we need to investigate the specific problem you are seeing. You can get the information by adding --verbose --exportTimes=timing.txt --exportProfile=profile.txt > verbose_output.txt to your trtexec command with TensorRT 6.0.

I attach the the screenshot of the profiling result for processes running with TensorRT5 and TensorRT6. Each row is the process executed by a thread. You can see that there are gaps between kernel functions in each thread. It wouldn’t happen if I just run one thread at a time. Regarding exporting the profiling, I don’t know how to use the arguments you provides. When I add the flags, it shows “unknown command line flag ‘exportTimes’” and ERROR: unknown command line flag ‘verbose’.

(Somehow I can’t upload image, so links are provided below)

Hello,

engineering ran tests of an engine consisting of a single simpleTopK kernel. Two instances of the engine are being run, each on a different thread.

High occupancy → Low amount of overlap
When the SM occupancy is high, the amount of overlap between kernels is low. The images [nvprof_b256_total_duration.JPG], show that out of 24.158ms only 58us are overlapped. This means only 0.24% of execution is overlapped.

The overlap basically happens only in the tail of the kernel execution.

Looking back at the visual profiler images, we see that the kernel has a gridsize of 131,072 thread blocks. During execution, every SM will take n thread blocks in a wave.

number_of_waves * kernel_wave_time = total_time

ceil(grid_size/(SMs * n)) * kernel_wave_time = total_time

ceil(131,072/(80*n)) * kernel_wave_time = total_time

(kernel_wave_time = 0.058 and total_time=24.158)

<=> ceil(131,072/(80*n)) = 24.158/0.058 =416.5

==> n = 4

This means we have 131072/(4*80) = 409.6 waves, or 409 full waves and 1 tail, where the overlap happens.

Low occupancy → High amount of overlap

Extreme case
The next example shows the extreme case where the SMs are extremely under-occupied. The overlap can happen anywhere, depending on whether other dependencies are met to launch the kernel (e.g., after the necessary H→D memcopies) see nvprof_b1_overlap.JPG

Partial overlap
In the image below we can see that the grid size is 512. Using n=4 as calculated in the first example, and #SMs=80, we can compute that there are 1.6 waves.

This means we can expect overlap halfway the execution. Indeed, calculating the overlap percentage we see 67us of overlap for a total of 136us kernel execution time, which is a 50% ratio.

(Note further, that it can be observed from the image that the second kernel seems to take a little longer. This is simply due to the fact that during the overlap period, the second kernel could only use 40% of SMs).

see nvprof_b16_total_duration.JPG, nvprof_b16_overlap nvprof_b16_second kernelExternal Media

External MediaExternal MediaExternal Medias://a70ad2d16996820e6285-3c315462976343d903d5b3a03b69072d.ssl.cf2.rackcdn.com/f5c34ff7b6888a5ce9c653dcd0ae9819[/img]