Latest CUDA Programming and Performance topics

Topic	Replies	Views	Activity
How to understand the following asm? cuda , kernel	5	160	April 10, 2024
Compilation Issues with CUDA 11.5 and GCC 11 on Ubuntu 22.04 - Need help	4	127	April 9, 2024
What are allreduce and bidirection bandwidth?	1	107	April 9, 2024
Shared memory dims and layout of matrix tiles loaded in cuda	1	167	April 8, 2024
Two device pointers pointing out same memory address deallocation problem cuda	1	237	April 8, 2024
SGEMM and SGEMV - large performance difference in cuBLAS	1	94	April 7, 2024
Overlapping CUDA Cores and Tensor Cores kernel	2	119	April 7, 2024
16-bit vs 32-bit Integer Arithmetic Performance cuda	3	124	April 21, 2024
Global memory access patterns - too slow cuda , performance	6	258	April 7, 2024
Reuse of L1/shared memory during execution of consecutive wavefronts	2	154	April 7, 2024
cuMemcpyHtoD CUDA ERROR INVALID VALUE cuda , debugging-and-troubleshooting	4	113	April 6, 2024
decision tree classifier in CUDA.. some doubts	8	2416	April 6, 2024
Cudamemcpy for different datatypes cuda	1	109	April 6, 2024
Ptxas slow cuda , kernel	34	639	April 5, 2024
Performance drop after specifying CUDA_VISIBLE_DEVICES=0 cuda	6	152	April 5, 2024
Grid size limit of concurrent kernels	5	228	April 5, 2024
Undocumented PTX instruction `fma.rn.f16`	3	105	April 5, 2024
compilation of device_launch_parameters.h and curand_kernel.h together produces errors related to C+	3	3109	April 5, 2024
Use vector load data from global mem to shm kernel	1	110	April 5, 2024
Are persistent kernels supported (now and in the future)?	11	192	April 4, 2024
Solving a Linear System of Equation with Very Large Sparse Coefficient Matrix Using SVD	0	91	April 4, 2024
Is it valid to concurrently read and write to disjoint segments of a single buffer allocated via cudaMallocHost cuda	5	127	April 3, 2024
Std::cuda::atomic::load() deadlock cuda	1	130	April 3, 2024
What happens when no arch flags passed by CMAKE	3	133	April 3, 2024
Kernel template user defined argument deduction guide	0	83	April 3, 2024
Launching multiple kernels in same context vs multiple kernels	5	3840	April 3, 2024
Using float4	5	7284	April 3, 2024
Solving `Ax=b` using pseudoinverse inside a cuda thread	6	161	April 3, 2024
GH200 Cuda not available on pytorch	4	182	April 2, 2024
DRAM Excessive Read Sectors	2	221	February 8, 2024

How to understand the following asm?

cuda , kernel

5

160

April 10, 2024

Compilation Issues with CUDA 11.5 and GCC 11 on Ubuntu 22.04 - Need help

4

127

April 9, 2024

What are allreduce and bidirection bandwidth?

1

107

April 9, 2024

Shared memory dims and layout of matrix tiles loaded in

cuda

1

167

April 8, 2024

Two device pointers pointing out same memory address deallocation problem

cuda

1

237

April 8, 2024

SGEMM and SGEMV - large performance difference in cuBLAS

1

94

April 7, 2024

Overlapping CUDA Cores and Tensor Cores

kernel

2

119

April 7, 2024

16-bit vs 32-bit Integer Arithmetic Performance

cuda

3

124

April 21, 2024

Global memory access patterns - too slow

cuda , performance

6

258

April 7, 2024

Reuse of L1/shared memory during execution of consecutive wavefronts

2

154

April 7, 2024

cuMemcpyHtoD CUDA ERROR INVALID VALUE

cuda , debugging-and-troubleshooting

4

113

April 6, 2024

decision tree classifier in CUDA.. some doubts

8

2416

April 6, 2024

Cudamemcpy for different datatypes

cuda

1

109

April 6, 2024

Ptxas slow

cuda , kernel

34

639

April 5, 2024

Performance drop after specifying CUDA_VISIBLE_DEVICES=0

cuda

6

152

April 5, 2024

Grid size limit of concurrent kernels

5

228

April 5, 2024

Undocumented PTX instruction `fma.rn.f16`

3

105

April 5, 2024

compilation of device_launch_parameters.h and curand_kernel.h together produces errors related to C+

3

3109

April 5, 2024

Use vector load data from global mem to shm

kernel

1

110

April 5, 2024

Are persistent kernels supported (now and in the future)?

11

192

April 4, 2024

Solving a Linear System of Equation with Very Large Sparse Coefficient Matrix Using SVD

0

91

April 4, 2024

Is it valid to concurrently read and write to disjoint segments of a single buffer allocated via cudaMallocHost

cuda

5

127

April 3, 2024

Std::cuda::atomic::load() deadlock

cuda

1

130

April 3, 2024

What happens when no arch flags passed by CMAKE

3

133

April 3, 2024

Kernel template user defined argument deduction guide

0

83

April 3, 2024

Launching multiple kernels in same context vs multiple kernels

5

3840

April 3, 2024

Using float4

5

7284

April 3, 2024

Solving `Ax=b` using pseudoinverse inside a cuda thread

6

161

April 3, 2024

GH200 Cuda not available on pytorch

4

182

April 2, 2024

DRAM Excessive Read Sectors

2

221

February 8, 2024

CUDA CUDA Programming and Performance