Shared Memory Vs Device Memory Device memory gives better result :fear:

Neha_Patil · April 15, 2007, 8:16am

Hi all,

I’m a newbie in CUDA programming.I have just started with Mean filtering of a 512 * 512 image.I understand that usage of shared memory results in low latency(high bandwidth) and global device memory results in high latency. I have got a problem in the usage of device memory n shared memory.I have two versions of mean filtering kernel- one using device memory n the other one using shared memory.The one using device memory seems to be taking less time than the one using shared memory.This luks really weird.May be the code using shared memory is not optimized.But I’m not able to find ways to optimize it. :(

I’m attaching the code- device memory access code n shared memory access code using different kernal (median_kernel_old, median_kernel ).Can any of u guys luk into these codes n suggest me the flaws in my programming as I’ve just been introduced CUDA programming ? It’d be of great help. :D

Thanks
Neha
mean.txt (9.6 KB)

Stewie · April 16, 2007, 3:12am

For one, the switch(tx) causes the threads to diverge; all the threads run all the cases using predication.

Neha_Patil · April 16, 2007, 3:50am

Thanks for the reply stewie. I found out that even the one without the shared memory runs extremely slower than a Program that runs on CPU. So, I commented the entire kernal part (of median_kernel_old) and executed the program. It still takes 90% of the original time taken with the code. That is, an empty kernel when called with the follow grid and block size takes 1300 ms to run. It doesnt make too much sense to me. Not sure what I’m missing :huh:

Grid dim = 64 X 64

Block dim = 8 X 8

Thanks

Neha

sicb0161 · April 16, 2007, 8:28am

Hi,

U should definitely take a bigger block size, e.g. 16 x 16 (programming guide 6.2). I dont know if you allocate shared memory, but this might be a bottle neck if you are using small block dimensions.

But it is unusual that an empty kernel take about 1 sec to execute. Are you sure that you do not allocate memory ?

cem

Topic		Replies	Views
Device memory VS Shared memory CUDA Programming and Performance	4	4087	September 22, 2008
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3732	October 14, 2008
shared memory CUDA Programming and Performance	4	3266	April 24, 2007
Shared memory using structure instead of array CUDA Programming and Performance	7	1298	February 29, 2020
shared memory vs local memory CUDA Programming and Performance	1	8064	December 12, 2011
Shared Memory/Large Strides CUDA Programming and Performance	2	2520	June 23, 2011
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
Timing Error CUDA Programming and Performance	7	4919	June 16, 2008
Significantly lower device memory bandwidth when using higher thread counts CUDA Programming and Performance	2	186	February 6, 2024
Kernel Konfiguration and Runtime CUDA Programming and Performance	6	2455	March 6, 2010

Shared Memory Vs Device Memory Device memory gives better result :fear:

Related topics