My problem about cuda function

Dear All, when practicing CUDA, I have some problems with this code.

//A, B are matrix(NxN)

void testCUDA(float** A, float** B, int N){

    float *d_A;

    size_t pitch_A;

    size_t col_size_A = N * sizeof(float);

    cudaMallocPitch(&d_A, &pitch_A, col_size_A, N);

        for (int i = 0; i < N; i++)

	    cudaMemcpy((char*)d_A + i*pitch_A, A[i], col_size_A, cudaMemcpyHostToDevice);

	for (int i = 0; i < N; i++)

	    cudaMemcpy(B[i], (char*)d_A + i*pitch_A, col_size_A, cudaMemcpyDeviceToHost);

}

After testCUDA(A,B,N), matrix B is same matrix A.

This code is OK until N > 1000. When I try N > 1000, matrix B is not same matrix A. I don’t know what happen. Please give me some advices. Thank All!

Are you trying to launch this function in a loop? Memory allocated for d_A is never released, and, if this function is launched repeatedly, the device may run out of video memory.

It is recommended to check return values of calls such as cudaMallocPitch. If you run out of memory, it will fail and return an error code.

Thank you. I understand this problem. We should release memory before copy new data.