Cant modify data on the GPU

mayouuu · December 17, 2008, 12:04am

Hi, im starting with CUDA and i have a simple code to understand how everything works here.

Code:

extern “C” void
CudaInitMemory(float* h_A ,float* h_B, float* h_C, unsigned int mem_size)
{

// allocate device memory
float* d_A;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size));
float* d_B;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, mem_size));

// copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, mem_size,
                          cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, mem_size,
                          cudaMemcpyHostToDevice) );
// allocate device memory for result
float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));                    
 
// setup execution parameters

// execute the kernel
SUMA <<< 1 , 100 >>> ( d_A, d_C);
   
	CUDA_SAFE_CALL(cudaMemcpy(h_B,d_A, mem_size,
                          cudaMemcpyDeviceToHost) );
	CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size,
                          cudaMemcpyDeviceToHost) );                         
                          
// clean up memory

CUDA_SAFE_CALL(cudaFree(d_A));
CUDA_SAFE_CALL(cudaFree(d_B));
CUDA_SAFE_CALL(cudaFree(d_C));

}

global void SUMA (float* A , float* C)
{
float Aux = 0;
shared float s_A[100];
shared float s_C[100];

// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int i = threadIdx.x;
C[i]=1; // I want to write "1" in each element to see if it makes anything!
__syncthreads();

}

If a us s_A as intermediate buffer it will show the same…

What am I doing wrong??
h_B and h_A are copied properly and after this code is run h_B contains the same as h_A
h_C contains 0 after the code is run…

Any help??
Thanks!

alex_dubinsky · December 17, 2008, 1:23am

Are you running in Debug mode? CUDA_SAFE_CALL won’t report execution failures otherwise.

tatou1234 · December 17, 2008, 9:42am

your problem is when you call:
SUMA <<< 1 , 100 >>> ( d_A, d_C); //you have only one block with 100 threads.

shared float s_A[100]; //the shared memory use only with the same block. but you don’t use s_A in your code.

int ty = threadIdx.y; //it hasn’t a logical because your threads are only on threadIdx.x => you can put this SUMA<<<1,(5,20)>>> //now you’ve one block but with threadIdx.x (0…4) and threadIdx.y (0…19)

and how is defined h_C? // float* h_C = (float*) malloc(mem_size); ??

p.d.: and how you know that h_C contains 0??? be carefull! you are using float! so printf float is with %f

mayouuu · December 17, 2008, 3:06pm

I used int ty = threadIdx.y in other thread and grid configurations, ive tried so many configs…, 100blocks 1block, 1 block 100 threads, 10 blocks 10 threads…

I print the vectors in the main program this way:

for(a=0;a<SIZE*SIZE;a++)

{

  cout<<h_C[a]<<"  ";

}

h_A y copied properly into h_B but h_C is 0000000000…0 after GPU is used…

This is h_C : float* h_C = (float*) malloc(mem_size_C);

Dont know what is happening to be receiving 0 after GPU…

bdg146psu · December 17, 2008, 3:14pm

The problem isn’t arising due to threadIdx.y. threadIdx.y will be 0 in each of your threads, since you’re using a one-dimensional block. No big deal there.

Not sure if this is the problem, but you need a third configuration parameter. The configuration parameters are: <<<gridDim, blockDim, shared_mem_size>>> You aren’t indicating how much shared memory you are using. It should be 200*sizeof(float) since you are allocating two 100-element float arrays.

tatou1234 · December 17, 2008, 3:33pm

try copy paste. ask me if you don’t understand me.

#include <cutil.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

global void SUMA (float* C)
{
int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;
C[i]=C[i]+1.0;

}

int main(int argc, char** argv)
{

unsigned int size =100;
unsigned int mem_size = sizeof(float) * size;

float* h_C = (float*) malloc(mem_size);

for (int i = 0; i < size; i++)
{
h_C[i] = (float)i;
}

float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));
CUDA_SAFE_CALL(cudaMemcpy(d_C, h_C, mem_size, cudaMemcpyHostToDevice) );
SUMA <<< 1 , size >>> (d_C);
CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size, cudaMemcpyDeviceToHost) );
for (int g = 0; g<size; g++)
printf(“g:%d , balioa: %f\n”, g, h_C[g]);
//printf(“h_C[%d]=%f\n”,atoi(argv[1]),h_C[atoi(argv[1])]);
CUDA_SAFE_CALL(cudaFree(d_C));
free(h_C);

}

mayouuu · December 17, 2008, 5:30pm

try copy paste. ask me if you don’t understand me.

include <cutil.h>

include <stdlib.h>

include <stdio.h>

include <string.h>

global void SUMA (float* C)

{

int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

C[i]=C[i]+1.0;

}

int main(int argc, char** argv)

{

unsigned int size =100;

unsigned int mem_size = sizeof(float) * size;

float* h_C = (float*) malloc(mem_size);

for (int i = 0; i < size; i++)

{
    h_C[i] = (float)i;
}

float* d_C;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));

CUDA_SAFE_CALL(cudaMemcpy(d_C, h_C, mem_size, cudaMemcpyHostToDevice) );

SUMA <<< 1 , size >>> (d_C);

CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size, cudaMemcpyDeviceToHost) );

for (int g = 0; g<size; g++)

printf(“g:%d , balioa: %f\n”, g, h_C[g]);

//printf(“h_C[%d]=%f\n”,atoi(argv[1]),h_C[atoi(argv[1])]);

CUDA_SAFE_CALL(cudaFree(d_C));

free(h_C);

}

Wow, i apreciate so much that you want to help me.

This code is showing me h_C as [ 0 1 2 3…99]

I should see h_C = [ 1 2 3 4 …100] if i understood properly de code…

Other info, im running Vista 64bits, compiling 32 bits, 9800GX2 180.48 WHQL drivers.

I tried the code under XP32 and it did the same ( my code ).

tatou1234 · December 18, 2008, 7:08am

can you change and then compile the examples of Nvidia?

mayouuu · December 18, 2008, 3:05pm

The examples modify data properly on my GPU, i have modified them to print matrix and it contains the result properly, if i put h_C=1 in the kernell it will show 1 properly out of it.

There has to be something tipically bad done to return allways “0” in the data output.

Dont know what…

bdg146psu · December 18, 2008, 6:40pm

Have you paid any attention at all to my original response? If you weren’t specifying how much shared memory to allocate, it’s possible your kernel was failing when you trying declaring some of it inside your kernel, which would result in the kernel not being executed. This would leave the memory space undefined, which (very easily) could return an array of zeros.

MisterAnderson42 · December 18, 2008, 8:04pm

The shared_mem_size parameter to the kernel launch is only for allocationg additional shared memory in extern shared arrays (see the programming guide). It is not needed when shared arrays are statically declared and sized at compile time.

bdg146psu · December 18, 2008, 8:25pm

My apologies. I suppose I’ve always allocated it as extern shared arrays, which is why I was thinking this was necessary. Thanks for the clarification.

mayouuu · December 19, 2008, 10:25pm

try copy paste. ask me if you don’t understand me.

include <cutil.h>

include <stdlib.h>

include <stdio.h>

include <string.h>

global void SUMA (float* C)

{

int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

C[i]=C[i]+1.0;

}

int main(int argc, char** argv)

{

unsigned int size =100;

unsigned int mem_size = sizeof(float) * size;

float* h_C = (float*) malloc(mem_size);

for (int i = 0; i < size; i++)

{
    h_C[i] = (float)i;
}

float* d_C;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));

CUDA_SAFE_CALL(cudaMemcpy(d_C, h_C, mem_size, cudaMemcpyHostToDevice) );

SUMA <<< 1 , size >>> (d_C);

CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size, cudaMemcpyDeviceToHost) );

for (int g = 0; g<size; g++)

printf(“g:%d , balioa: %f\n”, g, h_C[g]);

//printf(“h_C[%d]=%f\n”,atoi(argv[1]),h_C[atoi(argv[1])]);

CUDA_SAFE_CALL(cudaFree(d_C));

free(h_C);

}

Why the **** DOES - THIS - CODE - DONT - MODIFY - ANYTHING- ON - MY - GPU ???

I think everything is OK!!!

Ive also tried this:

global void SUMA (float* C)

{

shared float s_C[100];

int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

s_C[i]=C[i];

__syncthreads();

s_C[i]=0;

C[i]=s_C[i];

}

mayouuu · December 19, 2008, 10:51pm

Could any of you pass me a code wich creates for example a vector[100] wich contains 11111111…1 and pass it to the GPU and after that will contain 22222…2Â¿?, if you can mak this simple code for me and try it on your GPU i will apreciate it so so much… im just near to leaving my CUDA project… i must be a full idiot to not understand this…
I nedd to know if that simple code works in your GPU and if it does or not in mine… , it would help me to understand CUDA to hace a better start…

alex_dubinsky · December 20, 2008, 12:11am

There must be a problem with how you’re compiling, or it might be a very silly mistake like misplacing the outputfile.

You said you can compile the SDK samples and modify them successfully. Take an SDK sample project and modify it step-by-step, to eventually look like the code posted here. See if that will work.

Try to backtrack to something that works, and change it little by little. There is something very simple you’re doing wrong, and you’ll figure it out eventually. Sorry about your frustration.

P.S. How exactly are you compiling this, anyway? Are you starting from the “template” SDK project?

mayouuu · December 20, 2008, 12:41pm

It WORKS!!! it was my new project configuration, can any of you explain me how to config the output file??

I dont have any idea of what is needed to config a CUDA project, ive only added the CUDA directories to the project.

Thank YOu!

MisterAnderson42 · December 20, 2008, 2:54pm

For simple one file projects, just compile on the command line:

nvcc -o test test.cu

For more complicated projects, I use CMake (www.cmake.org) with FindCUDA.cmake (search the web to find)

Topic		Replies	Views
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1736	January 6, 2009
Device Memeroy allocation and data transfer Data transfer between host and device CUDA Programming and Performance	5	2646	June 16, 2011
help with a weird bug in the code doesn't writes to memory CUDA Programming and Performance	2	1644	January 28, 2010
My first program it doesn't behave as expected CUDA Programming and Performance	2	2549	July 19, 2009
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8770	February 17, 2010
Ok, what am I doing wrong here? Some kind of overwrite? CUDA Programming and Performance	32	14809	February 7, 2008
limit of computation CUDA Programming and Performance	44	33142	April 8, 2008
[Beginner] Memory is reseted in the kernel CUDA Programming and Performance	5	1379	October 29, 2010
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2793	August 2, 2010
The kernel isn't working CUDA Programming and Performance	9	1141	January 19, 2011

Cant modify data on the GPU

Related topics