Matrix Row Sum in Jcuda

Hi Here is my program that i’m trying to run for getting the matrix row sum. but at the end result in the sum is 0. i have tried the Matrix Row Sum in visual c and the program in c is working fine.

Code in java is

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;
import jcuda.runtime.JCuda;

/**
*
*
*/
public class MtrixRowSum {

/**

  • @param args the command line arguments
    */
    public static void main(String args) {

int M = 4, N = 4,P=16;

float scores_h = new float[M][N];
float a = new float {(float)1.35};
int first = new int[M][N];

int sum = new int[MN4];

int i, j;
//input in host array
for (i = 0; i<M; i++)
{
for (j = 0; j<N; j++)
{
scores_h[i][j] = 1;

}
}
//load the function
cuInit(0);
CUcontext pctx = new CUcontext();
CUdevice dev = new CUdevice();
cuDeviceGet(dev, 0);
cuCtxCreate(pctx, 0, dev);
//load the module
CUmodule module = new CUmodule();
cuModuleLoad(module, “matrixRowSum.ptx”);
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, “rowSum”);
CUdeviceptr a_dev1 = new CUdeviceptr();

// memory allocation
CUdeviceptr a_dev = new CUdeviceptr[P];
for(i=0;i<P;i++){
a_dev[i]=new CUdeviceptr();
// memory allocation
cuMemAlloc(a_dev[i], Sizeof.INT44);
}
for(i=0;i<M;i++){
// copy the content from host to GPU
cuMemcpyHtoD(a_dev[i], Pointer.to(scores_h[i]), Sizeof.FLOAT44);
}

CUdeviceptr b_dev = new CUdeviceptr[M];
for(i=0;i<M;i++){
b_dev[i]=new CUdeviceptr();
// memory allocation
cuMemAlloc(b_dev[i], Sizeof.INT44);
}

//Pointer object that will hold all the parameters
Pointer kernelParameters = Pointer.to(
Pointer.to(a_dev),
Pointer.to(b_dev)
);
cuLaunchKernel(function, 1, 1, 1, P, 1, 1, 0, null, kernelParameters, null);
//copy back the result from the GPU to host
for(i=0;i<M;i++){
// copy the content from host to GPU
cuMemcpyDtoH(Pointer.to(sum),b_dev[i], Sizeof.FLOAT44);

}
for(i=0;i<M;i++)
{
// print the result
System.out.println("sum: "+sum[i]);
}
//free the memory…
for(i = 0; i < P; i++)
{
cuMemFree(a_dev[i]);
}
for(i = 0; i < M; i++)
{
cuMemFree(b_dev[i]);
}
}
}

the matrixrowsum.ptx is the function in visual c 2013 which is working fine and the code is

extern “C”
global void RowSum(float* B, float* Sum, int N, int M)
{

int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

if (rowIdx < N) {
float sum = 0;
for (int k = 0; k < M; k++)
sum += B[rowIdx*M + k];
Sum[rowIdx] = sum;
}
}

there is no error just the result sum is–
run:
sum: 0
sum: 0
sum: 0
sum: 0
BUILD SUCCESSFUL (total time: 0 seconds)

please guide me what changes i should make in the program…

Hi richa, I think you should ask this question at devtalk CUDA Programming and Performance forum