python pandas how to select rows with common string values and find sum, divide each element?

I have a dataframe called Results3 that looks like:

Tag        Exp. m/z    Intensity    a
file1.xls  1000        10000        True
file1.xls  1100        20000        True
file1.xls  1200        30000        True
file2.xls  2000        11000        True
file2.xls  2100        12000        True
file2.xls  2200        13000        True
file2.xls  2300        14000        True
file3.xls  3000        31000        True
file3.xls  3100        123          True

where Tag = str, and Exp. m/z & Intensity are float64.

My goal is to create another column named norm_intensity, where each element in Intensity is divided by the sum of the intensity values for each tag in order to normalize the intensity values. So for example, for the above table Exp. m/z 1000 of file1.xls, the norm_intensity would be: 10000/(10000+20000+30000). Those in file2.xls would be divided by the sum of their intensity values just from file2.xls group.

I tried finding the sum by:

groups_sum = results3.groupby(results3['Tag'])

Intensity_sum = groups_sum.agg({'Intensity':sum})

But then I can't seem to figure out how to go from here. Eventually, I will be binning results3 by:

bins = np.arange(900, 3000, 1)
groups = results3.groupby([np.digitize(results3['Exp. m/z'], bins), 'Tag'])

I hope to have results3 in this format before I do the binning:

Tag        Exp. m/z    Intensity    Norm_Intensity
file1.xls  1000        10000        0.1666
file1.xls  1100        20000        0.3333
file1.xls  1200        30000        0.5
file2.xls  2000        11000        0.22
file2.xls  2100        12000        0.24
file2.xls  2200        13000        0.26
file2.xls  2300        14000        0.28
file3.xls  3000        31000        0.9960
file3.xls  3100        123          0.00395

1 answer

  • answered 2017-06-17 18:42 Scott Boston

    Let's use groupby, transform and sum:

    Result3.assign(Norm_Intensity=Result3.groupby('Tag')['Intensity'].transform(lambda x: x/x.sum()))
    

    Output:

             Tag  Exp. m/z  Intensity     a  Norm_Intensity
    0  file1.xls      1000      10000  True        0.166667
    1  file1.xls      1100      20000  True        0.333333
    2  file1.xls      1200      30000  True        0.500000
    3  file2.xls      2000      11000  True        0.220000
    4  file2.xls      2100      12000  True        0.240000
    5  file2.xls      2200      13000  True        0.260000
    6  file2.xls      2300      14000  True        0.280000
    7  file3.xls      3000      31000  True        0.996048
    8  file3.xls      3100        123  True        0.003952