replace pandas dataframe with a unique id

I got a dataframe with millions of entries, with one of the columns 'TYPE' (string). There is a total of 400 values for this specific column and I want to replace the values with integer id starting from 1 to 400. I also want to export this dictionary 'TYPE' => id for future reference. I tried with to_dict but it did not help. Anyway can do this ?

1 answer

  • answered 2018-01-14 10:04 MaxU

    Option 1: you can use pd.factorize:

    df['new'] = pd.factorize(df['str_col'])[0]+1
    

    Option 2: using category dtype:

    df['new'] = df['str_col'].astype('category').cat.codes+1
    

    or even better just convert it to categorical dtype:

    df['str_col'] = df['str_col'].astype('category')
    

    and when you need to use numbers instead just use category codes:

    df['str_col'].cat.codes
    

    thanks to @jezrael for extending the answer - for creating a dictionary:

    cats = df['str_col'].cat.categories
    d = dict(zip(cats, range(1, len(cats) + 1)))
    

    PS category dtype is very memory-efficient too