CUDA cores vs Tensor Cores

Hi,
Is there a way to force the execution of layer/s on the “regular” CUDA cores and not Tensor Cores?
The reason I’m asking is I have 2 networks and I’d like to try to run one on the CUDA cores and the other on the Tensor cores, hopefully in parallel.
Does that make sense? doable?

thanks
Eyal

Hi,

This is the low-level scheduling issue.
A better way is to let CUDA driver handle this since it is related to low-level resource allocation.

Thanks.

Thanks. But what happens if it does not do it?
I run the network/single convolution layer under nvprof and see that all the functions are using the volta_h844_xxx functions, which means that the tensor cores were used, right?
Now I would like to run another network in a different stream/thread and force it to use the regular CUDA cores, since the profiler seems to indicate it would also run on the already occupied tensor cores, I’d loose a potential overlap and additional speed boost, no?

thanks
Eyal

I have done some manual tests in order to see what is possible, please see the code below.

Blockquote
std::thread trt(& {
for (int i = 0; i < m_timing_loop_count; i++)
{
execution_engine->enqueue(m_batch_size, buffers, stream, nullptr);
}
cudaStreamSynchronize(stream);
});
std::thread blas(&{
cublasSetStream(m_blas_handle, m_blas_stream);
for (int i = 0; i < 20; i++) {
status = cublasSgemm(m_blas_handle, CUBLAS_OP_N, CUBLAS_OP_N, BLAS_N, BLAS_N, BLAS_N, &m_blas_alpha, m_blas_d_A,
BLAS_N, m_blas_d_B, BLAS_N, &m_blas_beta, m_blas_d_C, BLAS_N);
m_blas_alpha += (i * 1.f) / 5.f;
m_blas_beta += (i * 1.f) * 1.24f;
dummy += i * 10;
}
cudaStreamSynchronize(m_blas_stream);
});
trt.join();
blas.join();
cudaDeviceSynchronize();

This is what nvprof shows:

Blockquote
==28595== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities:
46.84% 104.04ms 20 5.2018ms 5.1443ms 5.2399ms volta_sgemm_128x128_nn
32.18% 71.468ms 100 714.68us 690.45us 839.03us trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
10.96% 24.345ms 100 243.45us 237.51us 332.71us void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)

So from what I understand, I am using the Tensor cores for TRT (trt_volta_h884cudnn…) and regular CUDA cores for BLAS (volta_sgemm_128x128_nn).

However the code finishes after 200-250ms, meaning it didn’t run concurrently. Replacing the BLAS code with a simple vector_add custom kernel, yields the same results - i.e. does not run concurrently.
I was able to get concurrency when the custom vector_add kernel was using a <<<1, 64>>> configuration but obviously this is not helping…

So if I understand correctly, while the Tensor cores are busy, the “regular” CUDA cores sit idle instead of doing some work. Is that true? anyway I can go around this? It seems to me that part/half of the hardware is not utilized while using the Tensor cores…

Any insights is more than welcomed.

thanks
Eyal

Hi,

CUDA cores is more like computing unit, there is no separate scheduler on it.
So if a job is waiting, the scheduler will assign to an idle unit.

It looks like your observation is something different.
Let us check your sample first and update more information with you later.

Thanks.

1 Like

Hi,
Thanks a lot! I’d be very happy to have a feedback on this matter. Hopefully this will yield a great boost to our algorithm.

thanks
Eyal

Hi,

To reproduce this issue more efficiently, would you mind to share a complete source with us?
Thanks.

Hi,
Attached is a tar file showing this issue.
You should run it like this: reset; ./a.out 1 64 512 512 64 7 7 1 1 0 1 0
It runs a 7x7 convolution via TRT and a simple test CUDA kernel.

There are 4 modes, please change the test_type variable in concurrentTest.cu line 297 and compile and run each time.

Those are the numbers I get on my Xavier:

EConvolutionOnly:
Total host : [7270.08 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 86.46% 6.31876s 500 12.638ms 12.503ms 16.393ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
7.27% 531.61ms 500 1.0632ms 1.0351ms 1.6980ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
6.27% 458.11ms 500 916.23us 893.64us 1.5226ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)

ECUDAOnly:
Total host : [6625.44 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.61508s 20 330.75ms 330.36ms 332.04ms kernel(float*, int)
API calls: 99.98% 6.61412s 4 1.65353s 20.928us 6.61399s cudaDeviceSynchronize

EConvolutionFollowedByCUDA:
Total host : [13989 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 47.34% 6.63279s 20 331.64ms 330.48ms 337.87ms kernel(float*, int)
45.48% 6.37232s 500 12.745ms 12.500ms 20.314ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
3.84% 538.37ms 500 1.0767ms 1.0350ms 2.0958ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
3.34% 468.66ms 500 937.32us 902.31us 1.9592ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)

EConvolutionAndCUDAConcurrently:
Total host : [14023.5 ms]
GPU activities: 47.71% 6.61691s 20 330.85ms 330.41ms 332.79ms kernel(float*, int)
45.20% 6.26890s 500 12.538ms 12.504ms 12.583ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
3.80% 526.53ms 500 1.0531ms 1.0293ms 1.1323ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
3.29% 456.95ms 500 913.89us 896.17us 967.62us void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)

concurrentTest_tar.txt (30 KB)

Thanks
Eyal

Hi,

Thanks for the sample.

We can compile and execute it in our environment now.
Will update more information with you once we got a progress.

Thanks.

Hi,
Any update on this issue? This issue is quite crucial for us in order to gain more performance.

Thanks
Eyal

Hi,

We check your code in detail.
It looks like that TensorRT already occupied all the GPU resource (Tensor Core and CUDA core) so the blas kernel need to wait for the resource.
The sequential execution is caused by the limited resource rather than a bug or issue.

We check the GPU utilization via following

EConcurrentTestType test_type = EConvolutionOnly;
$ sudo tegrastats

RAM 2714/7771MB … GR3D_FREQ 99%@905
RAM 2714/7771MB … GR3D_FREQ 99%@905
RAM 2714/7771MB … GR3D_FREQ 99%@905

Thanks.

Hi,
I’m not sure I follow. I’m not sure the tegrastats for this matter can indicate if there’s a problem or not - even if we see 99%.
Because:

  • If this was true then the trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1 function internal implementation would use both CUDA cores and Tensor cores to the max? It seems a bit weird.
  • If you compile the code with test_type = ECUDAOnly you’d also get 99% in tegrastats. This code does not use the Tensor cores for sure, and still yields 99%.

Are you sure both Tensor and CUDA cores are being used in the EConvolutionOnly scenario? if so, than there’s no more room for performance improvements there?

thanks
Eyal

Hi,
I would appreciate if you could have another look at this. Attached is the same code with minor changes.
I’ve ran it like this: ./a.out 1 64 64 64 16 7 7 1 1 0 1 0
So effectively 64 times less Tensor core work - I’d imagine the load would not occupy the GPU and will allow for both types of cores to work at the same time.
Tegrastats still shows 99% for any type of workload.

Those are the numbers I get (for reset; ./a.out 1 64 64 64 16 7 7 1 1 0 1 0 )

EConvolutionOnly
Total host : [4327.06 ms]

ECUDAOnly
Total host : [6895.48 ms]

EConvolutionAndCUDAConcurrently
Total host : [14603 ms]

It looks like things still run serially even for a much smaller load.

Furthermore, if I run the following code, which only utilizes one thread and one block, tegrastats still shows 99% utilization, making it a somewhat problematic tool to rely on. Unless I miss here something, this still indicates that the CUDA cores and Tensor cores do not run at the same time.

Blockquote
global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
printf("%d %d\n", n, blockDim.x * gridDim.x);
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}
float *m_dummy_data;
cudaMalloc(reinterpret_cast<void **>(&m_dummy_data), N * sizeof(m_dummy_data[0]));
for (int i = 0; i < 50; i++)
{
cout << “[” << i << “/50]” << endl;
kernel<<<1, 1>>>(m_dummy_data, N);
cudaDeviceSynchronize();
}

concurrentTest_cu.txt (16.2 KB)

Hi,

Sure.
We will give it a check and update more information with you later.

Thanks.

Hi,
I ran the above test ( ./a.out 1 64 64 64 16 7 7 1 1 0 1 0) using nvprof -f -o out.nvvp on Xavier and then opened it with nvvp on my desktop (CUDA 10.0). It seems very evident that the CUDA kernel did not run concurrently with the tensor code.

@AastaLLL please your advise.

Attached is the screen shot

Hi,

Please check the 61.8% kernel... bar, you can find the GPU is fully occupied between 2s to 9s.
Due to this, the other steam(#51) need to wait for the resource until 9-th second.

More, both tensor core and gpu core are used to finished the job attached in stream #14.

Please noticed that the 61.8% doesn’t indicates the GPU utilization.
It’s the compute ratio that an specified API occupied.

Thanks.