for mult, filter and etc etc, is it better to design with something power of 2? rather than odd number such as 100, 20 etc. for performance/efficiency reason.
also is there 1d conv filter in cuda lib?
If you intend to use FFT: yes.
Otherwise: potentially. Depends a lot on what you intend to do…
That is a vague question.
Do you mean would it make a difference if an image was 1024 x 1024 vs 1000 x 1000 , assuming you had the choice to determine those dimensions?
In that case if you were writing your own filter kernel then it may make the code easier to write since 32 (the size of a warp) divides evenly into 1024.
It also may make it slightly faster since there is no remainder, but that difference would be very small.
Probably the best way to look at it would be to try to have the workload/array size be divisible by at least 32 or a large power of two. If you are using commercial/open source libraries then it is probably not worth worrying about.
I look through some example code, it seem for block/thread size or address alignment etc often use power of 2 value.
also is there any example or guide on parallelize multiple nest loop in GPU?
blockDim(threads/block) -> 32n(multiples of warp-size), prefer to 256 or 512 (limited to 1024)
blockDim.x : blockDim.y -> 32:8 better than 16:16