Performance issue grouping algorithm

I need to count the number of elements in each group

I have about 32 000 groups, each object belongs to the 1000 group

my code to do the sum

kernel void VectorAdd3(
    global  read_only int* index,
    global  read_only float* values,
    global  read_only short* data,
    global  float2* mx)
    int factor = get_global_id(0);
    int fin = factor * 32;

    for (int i = 0, fi = factor * 100000; i < 100000; i++, fi++) {
        int mindex = index[i] + fin + data[fi];
        mx[mindex] += (float2)(values[i], 1);

factor = 1024

code running on GPU is much slower than on CPU

how can I improve the performance of my implementation, or do I need to implement a different algorithm for this task?