Numpy np.newaxis

saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);

Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.

Thanks In advance

1 answer

  • answered 2017-08-12 09:45 MaxU

    df_train['SalePrice'] is a Pandas.Series (vector / 1D array) of a shape: (N elements,)

    Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.

    df_train['SalePrice'][:,np.newaxis]
    

    transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).

    Demo:

    In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc'))
    
    In [22]: df
    Out[22]:
       a  b  c
    0  4  3  8
    1  7  5  6
    2  1  3  9
    3  7  5  7
    4  7  0  6
    
    In [23]: from sklearn.preprocessing import StandardScaler
    
    In [24]: df['a'].shape
    Out[24]: (5,)      # <--- 1D array
    
    In [25]: df['a'][:, np.newaxis].shape
    Out[25]: (5, 1)    # <--- 2D array
    

    There is Pandas way to do the same:

    In [26]: df[['a']].shape
    Out[26]: (5, 1)    # <--- 2D array
    
    In [27]: StandardScaler().fit_transform(df[['a']])
    Out[27]:
    array([[-0.5 ],
           [ 0.75],
           [-1.75],
           [ 0.75],
           [ 0.75]])
    

    What happens if we will pass 1D array:

    In [28]: StandardScaler().fit_transform(df['a'])
    C:\Users\Max\Anaconda4\lib\site-packages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t
    o float64 by StandardScaler.
      warnings.warn(msg, _DataConversionWarning)
    C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
    .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
     if it contains a single sample.
      warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
    C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
    .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
     if it contains a single sample.
      warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
    Out[28]: array([-0.5 ,  0.75, -1.75,  0.75,  0.75])