Simple long operation test

Hi. I have been developing computational library for Windows (uses CUDA abilities) under VS2010. I use CUDA runtime API.
I met unexpected problem with processing of large arrays. There is a code of simple long operation test I use to explain the problem.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

const int blocksize = 16;

__global__ void performOperation(float* A, int colsA, int rowsA, float4* B,
        int lenB) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
        return;
    int inx = i + j * colsA;
    for (int i = 0; i < lenB; i++) {
        A[inx] += B[i].x + B[i].y + B[i].z;
		
    }
}

__global__ void fillA(float* A, int colsA, int rowsA) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
        return;
    int inx = i + j * colsA;
    A[inx] = 0.0f;	
}

__global__ void fillB(float4* B, int lenB) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (!(((0 <= i) && (i < lenB))))
        return;
    B[i].x = 1.0f;
    B[i].y = 1.0f;
    B[i].z = 1.0f;
}

int main(void) {

    int colsA = 1000;
    int rowsA = 1000;

    size_t sizeA = colsA * rowsA * sizeof(float);
    float* dvA;
    cudaMalloc((void**) &dvA, sizeA);

    int lenB = 20000;
    size_t sizeB = lenB * sizeof(float4);
    float4* dvB;
    cudaMalloc((void**) &dvB, sizeB);

    dim3 dimBlock1(blocksize, blocksize);
    dim3 dimGrid1(ceil(float(colsA) / float(dimBlock1.x)),
            ceil(float(rowsA) / float(dimBlock1.y)));
    fillA<<<dimGrid1, dimBlock1>>>(dvA, colsA, rowsA);
    cudaThreadSynchronize();

    dim3 dimBlock2(blocksize * blocksize);
    dim3 dimGrid2(ceil(float(lenB) / float(dimBlock2.x)));
    fillB<<<dimGrid2, dimBlock2>>>(dvB, lenB);
    cudaThreadSynchronize();

    performOperation<<<dimGrid1, dimBlock1>>>(dvA, colsA, rowsA, dvB, lenB);
    cudaThreadSynchronize();

    float* a = new float[colsA * rowsA];
    cudaMemcpy(a, dvA, sizeA, cudaMemcpyDeviceToHost);

    for(int i = 0; i < 100; i++)
    {
        printf("%i %f \n", i, a[i]);
    }

    delete[] a;
    cudaFree(dvA);
    cudaFree(dvB);

    return 0;
}

The task is to process two dimesional array of float (int cols х rows) within the dependence on the linear array of float4 (int len). The len of the float4 array have to be greater or equal 20000 elements and less or equal 100000 elements. This is significant condition. The size of float array have to be approximately 1500 x 1500.
The maximum size of the float4 array is 100000 x 16 Bytes ~ 1,5 MBytes
The maximem size of the float array is 1500 x 1500 x 4 Bytes ~ 8 MBytes
In order to estimate CUDA abilities I wrote the program represented in the code I wrote above.
colsA & rowsA - responsable for the size of two dimesional float array
dvA - device representation of two dimesional float array

lenB - responsable for the length of float4 array
dvB - device representation of float array

There is a critical section of the code represented in the global function “performOperation”:

__global__ void performOperation(float* A, int colsA, int rowsA, float4* B,
        int lenB) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
        return;
    int inx = i + j * colsA;
    for (int i = 0; i < lenB; i++) {
        A[inx] += B[i].x + B[i].y + B[i].z;
		
   }

The idea is that I have to use all values of the array B in order to compute correct value of array A;

My system crashes when I try to start the programm with the indicated parameters (1500 x 1500 x 20000). I was testing this code under GeForce 9800 GT 1GB and driver v 306.94. Also I was testing this code under linux Ubuntu 12.04 on the same computer. The programm stops crashing when I significantly reducing the size of arrays! I can’t understand what is wrong.

I’ll be so thanksfull to every one who will show me the right way to think. I have no idea what to do. Just can say that I tryed alternative code which uses 4 lienar arrays instead of one float4 array. Ofcause the result is same.

When you say crash, what exactly happens? Have you checked the return values from your kernel call to see if it gives a helpful error?

At any rate, this kind of problem looks like it could be more easily solved using Thrust and a reduction.

When I use 1500 x 1500 x 20000 the graphical driver produces the message that it have to be restarted and then it was restarted successfully. After several starts Windows absolutelly dying and my computer restarts.
Also computation stops and float array A contains incorrect values. It have to be 3 times lenB for this example.
It couldn’t be hardware problem becaus I was testing this code on other computer with more powerfull graphic card. Unfortunatalyy I can not say it’s parameters right now. But I’ll do it today.

I have to notice one imporant thing. The kernel crahes when I try to access to any field of k-item of array B. For example, I can do float4 p = B[k], And I can not do A[inx] = p.x.

  • At any rate, this kind of problem looks like it could be more easily solved using Thrust and a reduction.
  • Could you give an axample for this task?

    This looks very much like the watchdog timer kicking in to keep your GUI responsive. For more info check my signature.

    Tera, if I correctly understood I have to try next two things:

    1. Trun off watchdog-timer
    2. Optimize the code in order it work faster.

    I recommend to any cuda programmer to take a look at the best practices document http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

    Pasoleatis: Yes, that is a very useful document. I’m not aware though it has information about the watchdog timer as well?

    BigBender: You can also share the work between multiple kernel invocations, each doing less work. Make sure to call cudaStreamQuery(0) in between kernel launches. Otherwise the Windows WDDM driver might batch them up again to reduce launch overhead, which would make the timeout apply to the whole batch again.

    Something like this:

    thrust::device_vector<float> A(colsA*rowsA,0.0f));
    
    thrust::device_vector<float> Bx(lenB,1.0f));
    thrust::device_vector<float> By(lenB,1.0f));
    thrust::device_vector<float> Bz(lenB,1.0f));
    
    float sum=thrust::reduce(Bx.begin(),Bx.end())+thrust::reduce(By.begin(),By.end())+thrust::reduce(Bz.begin(),Bz.end());
    
    for(int i=0;i<rowsA;i++){
    	for(int j=0;j<colsA;j++){
    	A[i*colsA+j]=sum;
    	}
    }
    

    Obviously this code is untested as I wrote it in the browser, you’ll need to include the appropriate Thrust headers.

    Thank everybody very much for the help. It realy was watchdog-timer. When I installed timer delay value to 20 sec everything started work normaly. More over I could change the size of the float array within the condition of my task. Now I’m goin to work over the code in order to optimize it.

    Pasoleatis, I had read this documen as well as other information about cuda. As Tera, I didn’t find any info about watchdog-timer there. To my opinion it is very important thing which programmer have to know about CUDA behavior on Windows.