How to deal with NaN and None in Multi-label binarization

I am doing a multi-label classification project with scikitlearn. What I am going to do is to binarize the target feature, however, I have some difficulties during the data transform.

Here is the raw data:

107              RA37|RA41|RM153 |RWT037
108    DA35|DA47|DWT030|DA35|DA47|DWT030
109                                  NaN
110                        PI001 |PI040 
111                        PI001 |PI040 
112                     RA37|RA41|RWT037
113    DA35|DA47|DWT030|DA35|DA47|DWT030
114                                  NaN
Name: exclusions, dtype: object

Then I split it up to more columns with str.split('|',expand=True) and I got the following output:

        0   1   2   3   4   5   6   7   8   9   ... 18  19  20  21  22  23  24  25  26  27
107 RA37    RA41    RM153   RWT037  None    None    None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
108 DA35    DA47    DWT030  DA35    DA47    DWT030  None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
109 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 PI001   PI040   None    None    None    None    None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
111 PI001   PI040   None    None    None    None    None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
112 RA37    RA41    RWT037  None    None    None    None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
113 DA35    DA47    DWT030  DA35    DA47    DWT030  None    None    None    None    ... None    None    None    None    None    None    None    None    None    None
114 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

As you can see, Since there are tons of NaN before processed, the result is mixed with NaN and None. That means I cannot directly use multilaberbinarizer to deal with all these different data types. How do it fix this problem, thanks in advance!

1 answer

  • answered 2018-04-17 04:21 Vivek Kumar

    Assuming the following list to be your multi-label targets:

    107              RA37|RA41|RM153 |RWT037
    108    DA35|DA47|DWT030|DA35|DA47|DWT030
    109                                  NaN
    110                         PI001 |PI040 
    111                         PI001 |PI040 
    112                     RA37|RA41|RWT037
    113    DA35|DA47|DWT030|DA35|DA47|DWT030
    114                                  NaN
    

    Part 1: Handling the Nan:

    There are multiple ways to handle the Nans:

    1) 'Nan' as target doesnt make sense. If you dont know what the target is, how will you train the model for that and how will you compare it to output. So the solution here is to remove the complete samples (rows) which have Nans in it. So the resultant targets will look like this:

    107              RA37|RA41|RM153 |RWT037
    108    DA35|DA47|DWT030|DA35|DA47|DWT030
    110                         PI001 |PI040 
    111                         PI001 |PI040 
    112                     RA37|RA41|RWT037
    113    DA35|DA47|DWT030|DA35|DA47|DWT030
    

    2) Replace the Nan with a new label, something like Unknown or Unclassified.

    107              RA37|RA41|RM153 |RWT037
    108    DA35|DA47|DWT030|DA35|DA47|DWT030
    109                              UNKNOWN
    110                         PI001 |PI040 
    111                         PI001 |PI040 
    112                     RA37|RA41|RWT037
    113    DA35|DA47|DWT030|DA35|DA47|DWT030
    114                              UNKNOWN
    

    Part 2: Using MultiLabelBinarizer:

    In both the above solutions, you will get a list of targets something like this:

    y = ['RA37|RA41|RM153|RWT037', 'DA35|DA47|DWT030|DA35|DA47|DWT030', 'UNKNOWN', 'PI001|PI040', 'PI001|PI040', 'RA37|RA41|RWT037', 'DA35|DA47|DWT030|DA35|DA47|DWT030', 'UNKNOWN']
    

    But MultilabelBinarizer accepts a list of list, so we need to split the above strings as you were doing:

    y = [y_val.split('|') for y_val in y]
    

    Now y is in correct format. Now use the MLB:

    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    y_encoded = mlb.fit_transform(y)
    
    # Output:
    array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
           [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
           [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
    

    You this can be used in a model of your choice (which should support the indicator matrix format above) for y.