I want to optimize C code to make it faster .it’s why I’ll translate it into CUDA. but, I’m new to programming in CUDA so I’m looking for help. (consulting, proposal, suggestion, any help will be useful to me)
here is my code C.
I took a quick look and there are couple of issues:
You are using 64 bit doubles, which means that you will have to use a GPU which has good 64 bit precision. That means Titan Black, Tesla K40, or Quadro K6000.
The end result mostly likely will be a hybrid CPU/GPU implementation, because there are serial dependencies in sections such as this:
for(k = 1; k <= N; k++)
total = 0;
for(m = 0; m < M; m++)
alpha[k][m] = alpha[k-1][to[m]] * gamma[k-1][to[m]] +
alpha[k-1][to[m]] * gamma[k-1][to[m]];
total += alpha[k][m];
The memory access reads look like they will be not be coalesced, so that will be another issue.
the problem space in your example does not really seem too large(at least that was my initial impression), so there may not be a huge advantage porting to CUDA.
If you do not really need 64-bit doubles for the calculations then you can move to a faster cheaper GPU such as the GTX 780ti ($500) or the GTX 980 ($600 if you can find one).
Because of the serial dependencies you will be best off using a high clock speed CPU as well, since the k loop will have to be done there.
thank you for your answer.
if I understand your thinking it is difficult to see this code in CUDA.
but I must do this work as part of a project to study.
I knew from the very beginning that it will be a hybrid C and CUDA. but I must imperatively translate Alpha beta and gamma to CUDA.
Here are the feature of my GPU
if you can do the translation of Alpha so I can understand the concepte. or otherwise explain to me what that is convertible or not.