How to optimize my simple code

kranthi · March 30, 2010, 8:44am

_
Hi,
I am here for a suggestion on how to improve the timing of execution for the below given code .I had a code to modify for obvious performance improvemance through cuda.I thought it would be better by converting to integer array instead of character array(which was previously used)
which was previously used all over the REST of the code).The code takes 125000 character array which has only characters A,C,G,T,$ and replaces by 1,2,3,4,0 in another integers array of same number of characters.
For this i am taking some part of T(character array)into shared memory and then convert those respective values .I intened that converting to integer array would improve my performance but instead it degraded .I wanna vary the number of characters array size and measure the times of execution of both cuda and non cuda version along with the number the of characters varying.
Please suggest me ways to optimize the below code in terms of times of execution when RUN.Please suggest me the number of threads per block and number of blocks guideline for this problem.I am novice to CUDA .please guide me .

global_ void toint(char *Td,int *Ind)
{
extern shared char sc_data;

int inOffset  = blockDim.x * blockIdx.x;
int in  = inOffset + threadIdx.x;
sc_data[threadIdx.x] = Td[in];
		__syncthreads();
if(sc_data[threadIdx.x]=='A')Ind[in]=1;
if(sc_data[threadIdx.x]=='C')Ind[in]=2;
if(sc_data[threadIdx.x]=='G')Ind[in]=3;
if(sc_data[threadIdx.x]=='T')Ind[in]=4;
if(sc_data[threadIdx.x]=='$')Ind[in]=0;

}
int main()
{

struct timeval start, stop, echodelay;//for time
if((gettimeofday(&start, NULL)) == -1) {perror(“gettimeofday”); exit(1);}//getting start time

char *Td;//character array on device
int *Ind;//integer array on device
char T[125000]=…125000 character;//character array on CPU
int In[125000];//integer array on CPU
int numThreadsPerBlock = 5;//number of threads per block
int numofblocks=25000;
int sharedMemSize = numThreadsPerBlock * sizeof(char);//shared memory size

cudaMalloc( (void **) &Td, 125000sizeof(char) );//allocating memory on device for character array
cudaMalloc( (void **) &Ind, 125000sizeof(int) );//allocating memory on device for integer array
cudaMemcpy(Td,T,125000sizeof(char), cudaMemcpyHostToDevice);
toint<<<numofblocks,numThreadsPerBlock,sharedMemSize>>>(Td,Ind);
cudaMemcpy(In,Ind,125000sizeof(int), cudaMemcpyDeviceToHost);

if((gettimeofday(&stop, NULL)) == -1){perror(“gettimeofday”);exit(1);}//getting end time
timeval_subtract(&echodelay, &stop, &start);//difference of time
printf("\n The time of execution is %d \n ",echodelay.tv_usec);

cudaFree(Td);
cudaFree(Ind);

return 0;
}

You7878 · March 31, 2010, 12:10pm

First, there is no need for shared memory. try #block_size = 256, #grid_size = (int NumberOfElements - 1 + #block_size) / #block_size

i would do it this way. does it reduce time execution?

[codebox]

global__ void toint(char *Td,int *Ind, int NumberOfElements)

{

int tid = blockDim.x * blockIdx.x + threadIdx.x;

if (tid < NumberOfElements)

{

if(char[tid]=='A')Ind[in]=1;

if(char[tid]=='B')Ind[in]=2;

…

}

[/codebox]

kbam · April 1, 2010, 12:01am

int numThreadsPerBlock = 5;//number of threads per block

With only 5 threads per block you will not get any coallesced reads/writes. (and the other 27 threads per warp are doing nothing.)

So as You7878 says change your block size, preferably to a multiple of 32.

Hopefully there are other parts of your code that you can move into cuda and make use of the Ind array and the values it now has.

Topic		Replies	Views
Performance issue CUDA Programming and Performance	3	2134	June 18, 2008
Performance optimization? CUDA Programming and Performance	6	5325	October 28, 2007
How to Optimize the CUDA Code? CUDA Programming and Performance	7	1166	March 24, 2013
A simple problem CUDA Programming and Performance	10	5275	October 11, 2007
Counting characters whats the best strategy? CUDA Programming and Performance	3	3957	May 13, 2009
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8672	December 7, 2009
Why taking so much time? CUDA Programming and Performance	22	3518	June 27, 2009
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10742	January 4, 2011
numblocks and threads allocation rule? CUDA Programming and Performance	9	13329	March 5, 2011
cuda fastest for one block, one thread? CUDA Programming and Performance	2	1470	February 1, 2011

How to optimize my simple code

Related topics