Shared Memory Vs Device Memory Device memory gives better result :fear:

Hi all,

I’m a newbie in CUDA programming.I have just started with Mean filtering of a 512 * 512 image.I understand that usage of shared memory results in low latency(high bandwidth) and global device memory results in high latency. I have got a problem in the usage of device memory n shared memory.I have two versions of mean filtering kernel- one using device memory n the other one using shared memory.The one using device memory seems to be taking less time than the one using shared memory.This luks really weird.May be the code using shared memory is not optimized.But I’m not able to find ways to optimize it. :(

I’m attaching the code- device memory access code n shared memory access code using different kernal (median_kernel_old, median_kernel ).Can any of u guys luk into these codes n suggest me the flaws in my programming as I’ve just been introduced CUDA programming ? It’d be of great help. :D

mean.txt (9.6 KB)

For one, the switch(tx) causes the threads to diverge; all the threads run all the cases using predication.

Thanks for the reply stewie. I found out that even the one without the shared memory runs extremely slower than a Program that runs on CPU. So, I commented the entire kernal part (of median_kernel_old) and executed the program. It still takes 90% of the original time taken with the code. That is, an empty kernel when called with the follow grid and block size takes 1300 ms to run. It doesnt make too much sense to me. Not sure what I’m missing :huh:

Grid dim = 64 X 64

Block dim = 8 X 8




U should definitely take a bigger block size, e.g. 16 x 16 (programming guide 6.2). I dont know if you allocate shared memory, but this might be a bottle neck if you are using small block dimensions.

But it is unusual that an empty kernel take about 1 sec to execute. Are you sure that you do not allocate memory ?