Internal error: CASK: all shaders must have unique names

mvillmow · September 17, 2019, 9:23pm

TensorRT 6.0 has been released with some fixes for multi-threading multiple engines. Please try the newest release and let us know if the issue has been resolved.

The expected behavior by application however is one ExecutionContext per cpu thread.

hl2997 · September 17, 2019, 11:24pm

It seems that in the multi-threading some layers in one engine (especially convolution) are blocking other engine. Is it something related to MPS or something else?

mvillmow · September 17, 2019, 11:31pm

Can you provide the tactic ID’s of the layers you believe are misbehaving from the Verbose log? We can have an engineer look into this in more detail.

hl2997 · September 17, 2019, 11:37pm

Hi I just profiled with nvvp and saw the phenomenon. This happens for almost all convolutions.
Here is a more detail of my program:

Create multiple threads:
Create context and the engine in each thread
Run enqueue in each thread.

BTW, I am using TensorRT 5.1.5. I don’t know how to get tactic ID from nvvp.

mvillmow · September 18, 2019, 4:54pm

There are issues that are known to have occurred in 5.1 that fixed in 6.0, we just need to verify that you are no longer seeing those issues, otherwise we need to investigate the specific problem you are seeing. You can get the information by adding --verbose --exportTimes=timing.txt --exportProfile=profile.txt > verbose_output.txt to your trtexec command with TensorRT 6.0.

hl2997 · September 19, 2019, 1:10am

I attach the the screenshot of the profiling result for processes running with TensorRT5 and TensorRT6. Each row is the process executed by a thread. You can see that there are gaps between kernel functions in each thread. It wouldn’t happen if I just run one thread at a time. Regarding exporting the profiling, I don’t know how to use the arguments you provides. When I add the flags, it shows “unknown command line flag ‘exportTimes’” and ERROR: unknown command line flag ‘verbose’.

(Somehow I can’t upload image, so links are provided below)

NVES · November 4, 2019, 11:05pm

Hello,

engineering ran tests of an engine consisting of a single simpleTopK kernel. Two instances of the engine are being run, each on a different thread.

High occupancy → Low amount of overlap
When the SM occupancy is high, the amount of overlap between kernels is low. The images [nvprof_b256_total_duration.JPG], show that out of 24.158ms only 58us are overlapped. This means only 0.24% of execution is overlapped.

The overlap basically happens only in the tail of the kernel execution.

Looking back at the visual profiler images, we see that the kernel has a gridsize of 131,072 thread blocks. During execution, every SM will take n thread blocks in a wave.

number_of_waves * kernel_wave_time = total_time

ceil(grid_size/(SMs * n)) * kernel_wave_time = total_time

ceil(131,072/(80*n)) * kernel_wave_time = total_time

(kernel_wave_time = 0.058 and total_time=24.158)

<=> ceil(131,072/(80*n)) = 24.158/0.058 =416.5

==> n = 4

This means we have 131072/(4*80) = 409.6 waves, or 409 full waves and 1 tail, where the overlap happens.

Low occupancy → High amount of overlap

Extreme case
The next example shows the extreme case where the SMs are extremely under-occupied. The overlap can happen anywhere, depending on whether other dependencies are met to launch the kernel (e.g., after the necessary H→D memcopies) see nvprof_b1_overlap.JPG

Partial overlap
In the image below we can see that the grid size is 512. Using n=4 as calculated in the first example, and #SMs=80, we can compute that there are 1.6 waves.

This means we can expect overlap halfway the execution. Indeed, calculating the overlap percentage we see 67us of overlap for a total of 136us kernel execution time, which is a 50% ratio.

(Note further, that it can be observed from the image that the second kernel seems to take a little longer. This is simply due to the fact that during the overlap period, the second kernel could only use 40% of SMs).

see nvprof_b16_total_duration.JPG, nvprof_b16_overlap nvprof_b16_second kernelExternal Media

External Media External Media External Medias://a70ad2d16996820e6285-3c315462976343d903d5b3a03b69072d.ssl.cf2.rackcdn.com/f5c34ff7b6888a5ce9c653dcd0ae9819[/img]

Topic		Replies	Views
TF-TRT INT8 Failing to convert due to no calibration TensorRT	3	1383	April 2, 2019
Tensorflow computation takes too much time for face recognition and detection Jetson TX2	5	1310	October 18, 2021
Don't get any 'TRTEngineOp' after optimizing model via TensorRT in Jeton TX2 TensorRT	17	3673	October 12, 2021
Extremely long time to load TRT-optimized frozen TF graphs TensorRT	31	10060	October 12, 2021
No improvement in inference performance after Opt. with TensorRT TensorRT	6	1222	April 15, 2020
TRT optimize graph not faster than unoptimized (nvidia/tensorrt:19.01-py3 image) TensorRT	7	2152	October 12, 2021
problem with TFTRT TensorRT	4	1625	January 30, 2019
when using Tensorrt 6.0.1.5, Cudnn Error in initializeCommonContext: 4 TensorRT	7	4616	March 19, 2020
TensorRT: Can't find device placement for op TensorRT	3	1468	September 18, 2019
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2905	January 18, 2019

Internal error: CASK: all shaders must have unique names

Related topics