Latest CUDA Programming and Performance topics

Topic	Replies	Views	Activity
Using GPUdirect for video with Mellanox ConnectX	1	352	April 14, 2024
CUDA Warp primitive behaviour question	2	145	April 13, 2024
cuStreamWaitValue32 and cuStreamWriteValue32 blocking issue	8	154	April 12, 2024
Can threads from different warps access shared memory at the same time?	3	105	April 12, 2024
Dual RTX 4090 with distributed training cuda , pytorch , deep-learning	2	197	April 12, 2024
Fast Implementation of (Small-)Table Lookup cuda , kernel	13	279	April 12, 2024
Optimizing for many concurrent kernels	1	120	April 12, 2024
Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function opencv , cuda	25	748	April 12, 2024
[CUDA8.0 BUG?] Child process forked after cuInit() get CUDA_ERROR_NOT_INITIALIZED on cuInit()	7	4168	April 12, 2024
Local memory management	8	140	April 12, 2024
CUDA-context consume more GPU memory in ChildProcess(start by execl) than in ParentProcess(eg. 186MB more than 108MB) Why?	6	143	April 12, 2024
Creating texture objects globally and update the memory allocated each time when there is a change in the data cuda	1	86	April 11, 2024
Invalid configuration argument for one kernel but works for another	3	96	April 11, 2024
Need example to disable nvlink	10	3947	April 11, 2024
How to test if tensor cores are working? (CMP 100-210)	13	225	April 11, 2024
Question the time cost of a blank kernel cuda , kernel	3	208	April 11, 2024
The order of CTA execution	5	206	April 11, 2024
Nsight compute fail to profile L20 gpu	7	198	April 11, 2024
The configuration of GPU Time-Slice on Kubernetes gpu , kubernetes	1	126	April 11, 2024
Does runtime API will call drive API?	2	98	April 11, 2024
Second cuCtxCreate() returns CUDA_ERROR_LAUNCH_FAILED with A2 GPU	3	154	April 10, 2024
How to understand the following asm? cuda , kernel	5	159	April 10, 2024
Compilation Issues with CUDA 11.5 and GCC 11 on Ubuntu 22.04 - Need help	4	120	April 9, 2024
What are allreduce and bidirection bandwidth?	1	99	April 9, 2024
Shared memory dims and layout of matrix tiles loaded in cuda	1	163	April 8, 2024
Two device pointers pointing out same memory address deallocation problem cuda	1	233	April 8, 2024
SGEMM and SGEMV - large performance difference in cuBLAS	1	93	April 7, 2024
Overlapping CUDA Cores and Tensor Cores kernel	2	114	April 7, 2024
16-bit vs 32-bit Integer Arithmetic Performance cuda	2	118	April 7, 2024
Global memory access patterns - too slow cuda , performance	6	250	April 7, 2024

Using GPUdirect for video with Mellanox ConnectX

1

352

April 14, 2024

CUDA Warp primitive behaviour question

2

145

April 13, 2024

cuStreamWaitValue32 and cuStreamWriteValue32 blocking issue

8

154

April 12, 2024

Can threads from different warps access shared memory at the same time?

3

105

April 12, 2024

Dual RTX 4090 with distributed training

cuda , pytorch , deep-learning

2

197

April 12, 2024

Fast Implementation of (Small-)Table Lookup

cuda , kernel

13

279

April 12, 2024

Optimizing for many concurrent kernels

1

120

April 12, 2024

Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function

opencv , cuda

25

748

April 12, 2024

[CUDA8.0 BUG?] Child process forked after cuInit() get CUDA_ERROR_NOT_INITIALIZED on cuInit()

7

4168

April 12, 2024

Local memory management

8

140

April 12, 2024

CUDA-context consume more GPU memory in ChildProcess(start by execl) than in ParentProcess(eg. 186MB more than 108MB) Why?

6

143

April 12, 2024

Creating texture objects globally and update the memory allocated each time when there is a change in the data

cuda

1

86

April 11, 2024

Invalid configuration argument for one kernel but works for another

3

96

April 11, 2024

Need example to disable nvlink

10

3947

April 11, 2024

How to test if tensor cores are working? (CMP 100-210)

13

225

April 11, 2024

Question the time cost of a blank kernel

cuda , kernel

3

208

April 11, 2024

The order of CTA execution

5

206

April 11, 2024

Nsight compute fail to profile L20 gpu

7

198

April 11, 2024

The configuration of GPU Time-Slice on Kubernetes

gpu , kubernetes

1

126

April 11, 2024

Does runtime API will call drive API?

2

98

April 11, 2024

Second cuCtxCreate() returns CUDA_ERROR_LAUNCH_FAILED with A2 GPU

3

154

April 10, 2024

How to understand the following asm?

cuda , kernel

5

159

April 10, 2024

Compilation Issues with CUDA 11.5 and GCC 11 on Ubuntu 22.04 - Need help

4

120

April 9, 2024

What are allreduce and bidirection bandwidth?

1

99

April 9, 2024

Shared memory dims and layout of matrix tiles loaded in

cuda

1

163

April 8, 2024

Two device pointers pointing out same memory address deallocation problem

cuda

1

233

April 8, 2024

SGEMM and SGEMV - large performance difference in cuBLAS

1

93

April 7, 2024

Overlapping CUDA Cores and Tensor Cores

kernel

2

114

April 7, 2024

16-bit vs 32-bit Integer Arithmetic Performance

cuda

2

118

April 7, 2024

Global memory access patterns - too slow

cuda , performance

6

250

April 7, 2024

CUDA CUDA Programming and Performance