Numpy np.newaxis
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.
Thanks In advance
1 answer

df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc')) In [22]: df Out[22]: a b c 0 4 3 8 1 7 5 6 2 1 3 9 3 7 5 7 4 7 0 6 In [23]: from sklearn.preprocessing import StandardScaler In [24]: df['a'].shape Out[24]: (5,) # < 1D array In [25]: df['a'][:, np.newaxis].shape Out[25]: (5, 1) # < 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape Out[26]: (5, 1) # < 2D array In [27]: StandardScaler().fit_transform(df[['a']]) Out[27]: array([[0.5 ], [ 0.75], [1.75], [ 0.75], [ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a']) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t o float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) Out[28]: array([0.5 , 0.75, 1.75, 0.75, 0.75])
See also questions close to this topic

How to read an excel file directly from a Server with Python
Scenario: I am trying to read a excel file from a server folder and after that read each worksheet of that file into a dataframe and perform some operations.
Issue: I have trying multiple approaches but facing different situations: either I read the file, but it is seen as a str and the operations cannot be performed, or the file is not read.
What I tried so far:
#first attempt os.path(r'\\X\str\Db\C\Source\selection\Date\Test','r') #second attempt directory = os.getcwd() + "\\C\\Source\\selection\\Date\\Test" #third attempt f = os.getcwd() + "\\C\\Source\\selection\\Date\\Test\\12.xlsx" #fourth attempt f = open(r'\\X\str\Db\C\Source\selection\Date\Test\12.xlsx', 'r') db1 = pd.DataFrame() db2 = pd.DataFrame() db3 = pd.DataFrame() bte = pd.DataFrame() fnl = pd.DataFrame() wb = load_workbook(f) for sheet in wb.worksheets: if sheet.title == "db1": db1 = pd.read_excel(f, "db1")
Obs: I also researched the documentation for reading with pd and some other similar questions in SO, but still could not solve this problem. Ex: Python  how to read path file/folder from server Using Python, how can I access a shared folder on windows network? https://docs.python.org/release/2.5.2/tut/node9.html#SECTION009200000000000000000
Question: What is the proper way to achieve this?

Extracting the max, min or std from a DF for a particular column in pandas
I have a df with columns X1, Y1, Z3 a df.describe shows the stats for each column
I would like to extract the min, max and std for say column Z3. df[df.z3].idxmax() doesn't seem to work

GeoPandas plot function not working
I have a shapefile that looks like this on mapshaper:
But when I tried to plot it in pandas with the following code
police = gpd.read_file('srilanka_policestations') police.plot()
jupyter notebook gives me an error message saying "AttributeError: 'str' object has no attribute 'type'".
I'm not sure what's wrong. I tried to plot the GeoPandas dataset "naturalearth_cities", and it works fine. See below:
The geodataframe reads fine in pandas, but it wouldn't plot:
Any help is much, much appreciated. Thank you all!

Python flatten array inside numpy array
I have a pretty stupid question, but for some reason, I just can't figure out what to do. I have a multidimensional numpy array, that should have the following shape:
(345138, 30, 300)
However, it actually has this shape:
(345138, 1)
inside the 1 elementarray is the array containing the shape
(30, 300)
So how do I "move" the inside array, so that the shape is correct?
At the moment it looks like this:
[[ array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int32)] [ array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ...,
but I want this without the array(...), dtype=32 and move what is in there into the first array so that the shape is (345138, 30, 300) and looks like this:
[[ [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], [ [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ...,
Any ideas?

Creating an numpy array for .imshow plotting for any curve
I am trying to create a numpy array out of two vectors X and Y
I have two vectors: X and Y, those are unequally spaced coordinates (geological horizon, if important). To be more precise, X is offset and Y is depth.
import numpy as np import pandas as pd import matplotlib.pyplot as plt x = np.linspace(0,100,100) y = np.linspace(0,100,100)
If I plot it with
plt.plot(x,y)
orplt.scatter(x,y)
it works fine plt.plot output, but what I need is to create a numpy array out of it, so I can useplt.imshow
to plot it. I need it for any type of line given by x,y coordinates, not necessary linearly spaced. The line should be shown with single color.Please, help me out.

How to get interpolated array values in numpy / scipy
I was wondering how I could get the interpolated value of a 3D array. I am trying to get the value at for example position: (1.4, 2.3, 4.2) of a 3d array. How can I get the interpolated value?
counterX = 1.5 counterY = 1.5 counterZ = 1.5 for x in range(0, length) for y in range(0, length) for z in range(0, length) value = img[counterX, counterY, counterZ] counterZ = 0 counterY = 0
counterX, counterY and counterZ are float values rather than integers. However I cannot css them int(...) since my results need to be very exact. Therefore I thought interpolation would be the best solution.

What's the best programming language for machine learning?
Which language offers the best tools and libraries for machine learning?

Can Linear SVM be used to build a lexicon for a specific set of categories?
I want to build a lexicon for a specific set of educational set of categories. For example: Math, Medical and Economics. So I've manually collected 150 websites per category.
I used the Python library Scikitlearn to implement the algorithm. I basically just followed the tutorial from scikitlearn but instead of using their data, I've used the websites that I collected. This is a supervised task so after preprocessing (removing stop words, using TfidfVectorizer), I fed it to the LinearSVC model of scikitlearn.
What I realized is that LinearSVC weighted each word in the input vector for each category. Here is an example of what it looks like:
So my question is, can this be considered a lexicon for Math, Medical and Economics?

PyTorch: How to get around the RuntimeError: inplace operations can be only used on variables that don't share storage with any other variables
With PyTorch I'm having a problem doing an operation with two Variables:
sub_patch : [torch.FloatTensor of size 9x9x32] pred_patch : [torch.FloatTensor of size 5x5x32]
sub_patch is a Variable made by torch.zeros pred_patch is a Variable of which I index each of the 25 nodes with a nested forloop, and that I multiply with its corresponding unique filter (sub_filt_patch) of size [5,5,32]. The result is added to its respective place in sub_patch.
This is a piece of my code:
for i in range(filter_sz): for j in range(filter_sz): # index correct filter from filter tensor sub_filt_col = (patch_col + j) * filter_sz sub_filt_row = (patch_row + i) * filter_sz sub_filt_patch = sub_filt[sub_filt_row:(sub_filt_row + filter_sz), sub_filt_col:(sub_filt_col+filter_sz), :] # multiply filter and pred_patch and sum onto sub patch sub_patch[i:(i + filter_sz), j:(j + filter_sz), :] += (sub_filt_patch * pred_patch[i,j]).sum(dim=3)
The error I get from the bottom line of the piece of code here is
RuntimeError: inplace operations can be only used on variables that don't share storage with any other variables, but detected that there are 2 objects sharing it
I get why it happens, since sub_patch is a Variable, and pred_patch is a Variable too, but how can I get around this error? Any help would be greatly appreciated!
Thank you!

How to plot loss vs epoch in Sklearn
I'm using a random forest classifier from the sklearn library. Could someone help me figure out how to save the loss vs epoch number during training?

Sklearn DBscan Cannot Fit CSR Sparse Data
I have some sparse data which I transform into CSR sparse vector:
from scipy.sparse import coo_matrix num_news = indexed.agg(max(indexed["newsIndex"])).take(1)[0][0] + 1 # maximum index of the news in the data def get_matrix(news): row = [0 for i in news] data = [1 for i in news] return coo_matrix((data, (row,news)), shape=(1, num_news)).tocsr() d['feature'] = d['newsArr'].apply(get_matrix)
Then, I show it by using
pd.head
:uuid newsArr feature 0 014324000050581 [300.0, 274.0] (0, 274)\t1\n (0, 300)\t1 1 014379002854034 [3539.0, 1720.0, 402.0, 1787.0, 2854.0, 2500.0... (0, 402)\t1\n (0, 492)\t1\n (0, 493)\t1\n ... 2 014379004874618 [346.0] (0, 346)\t1 3 014379004904357 [592.0, 1586.0, 20.0, 4165.0, 19.0, 165.0, 12.0] (0, 12)\t1\n (0, 19)\t1\n (0, 20)\t1\n (0... 4 014379004920072 [1658.0, 283.0, 7.0, 492.0] (0, 7)\t1\n (0, 283)\t1\n (0, 492)\t1\n (...
The output of
d['feature'][:1].tolist()
is the following:[<1x93315 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>]
Then I want to use DBscan:
from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'])
However, I receive the following error:
ValueError: setting an array element with a sequence.
I believe it is not reasonable since my vector is
1*num_news
. Then I try to usetolist()
:db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'].tolist())
The following error pops up:
ValueError: Expected 2D array, got 1D array instead: array=[ <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format> <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 19 stored elements in Compressed Sparse Row format> <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format> ..., <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format> <1x93315 sparse matrix of type '<type 'numpy.int64'>' with 15 stored elements in Compressed Sparse Row format>]. Reshape your data either using array.reshape(1, 1) if your data has a single feature or array.reshape(1, 1) if it contains a single sample.
I know that
sklearn
can use CSR sparse matrix as input, how can I do that ? 
How to plot multiple data sets one by one from one single file in python?
I have a couple hundred data sets in one file that are separated by two lines of text:
Text line 1 Text line 2 1 3 5 2 4 7 3 7 9 4 8 10 5 2 4 Text line 1 Text line 2 1 2 2 1 9 7 3 7 9 3 3 0 5 6 4 3 5 9
I'd like to fit the values from two of the columns from each set separately and have the text before the data are written in the plot as well (to know the properties of the model). As there are so many it would take too much time to do them manually. My current solution is to write from where to where I want the data plotted, but the number of lines of the text is not the same, so it'd take a lot of time to do it.

How does classification algorithm works if user is tagged to more than one class of predictor variable?
I am new to data analytics.
I have data set where I need to predict what is the best and most relevant product group for the user.
Product groups are set of values. Hence I decided to use classification method.
Now the problem is how do I prepare training data set. So we have data model where one user has been labeled to more than one product groups. i.e., the array of product groups. How do I prepare data set in this case?
I have a user table. and another table which has details of the products purchased by the user. User to order table is 1 to many relationships. The product belongs to multiple product groups. So the goal is to predict most relevant product group. So I need to predict one class. but the user is tagged to multi class (as i explained above user purchase product into more than one groups). Now when I am preparing training data set into a csv file. So do I need to have multiple rows for the same user with individual product groups?

Default parameter on python function not always working
I'm reading Programming Collective Intelligence and writing some of the code in a more pythonic way than it's written in the book, just for the sake of learning.
The first chapter is about recommendation systems. Based on the next dictionary, some similarity measures are proposed.
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0}, 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5}, 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0}, 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5}, 'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
Given that unique_pairs is a list of tuples containing the different possible pairs of people,
unique_pairs = list(itertools.combinations(people, 2)) unique_pairs [('Michael Phillips', 'Mick LaSalle'), ('Michael Phillips', 'Lisa Rose'), ('Michael Phillips', 'Toby'), ('Michael Phillips', 'Jack Matthews'), ('Michael Phillips', 'Gene Seymour'), ('Michael Phillips', 'Claudia Puig'), ('Mick LaSalle', 'Lisa Rose'), ('Mick LaSalle', 'Toby'), ('Mick LaSalle', 'Jack Matthews'), ('Mick LaSalle', 'Gene Seymour'), ('Mick LaSalle', 'Claudia Puig'), ('Lisa Rose', 'Toby'), ('Lisa Rose', 'Jack Matthews'), ('Lisa Rose', 'Gene Seymour'), ('Lisa Rose', 'Claudia Puig'), ('Toby', 'Jack Matthews'), ('Toby', 'Gene Seymour'), ('Toby', 'Claudia Puig'), ('Jack Matthews', 'Gene Seymour'), ('Jack Matthews', 'Claudia Puig'), ('Gene Seymour', 'Claudia Puig')]
I tried to improve the Pearson Correlation similarity function suggested in the book by adding a pvalue to the result of the function, only outputted if the parameter p_value of the function is true. The function is defined this way:
def sim_pearson(prefs, p1, p2, p_value=False): """Returns the pearson correlation coefficient and the pvalue (optional) of the ratings of the movies that both p1 and p2 have rated""" # Creates a list with the movies that both p1 and p2 have rated movies = [movie for movie in prefs[p1] if movie in prefs[p2]] # List of the scores that both p1 and p2 have given to the movies in common scores_p1 = [prefs[p1][movie] for movie in movies] scores_p2 = [prefs[p2][movie] for movie in movies] corr, p_value = scipy.stats.pearsonr(scores_p1, scores_p2) if p_value: return (corr, p_value) else: return corr
My problem is that the function doesn't work as expected, as it doens't returns the tuple of (correlation coefficient, pvalue) all the times when pvalue is True, and it produces the same results when p_value is True as when it is false. Why is this happening and how could I fix it?
Here is a list containing the result of applying the function to each of the possible pairs of people, to see what I said. The result is the same with p_value=True as with p_value=False, I'll just paste the former case.
pearson_results = [(pair[0][:5], pair[1][:5], sim_pearson(critics, pair[0], pair[1], p_value=True)) for pair in unique_pairs] pearson_results [('Micha', 'Mick ', (0.2581988897471611, 0.74180111025283857)), ('Micha', 'Lisa ', (0.40451991747794525, 0.59548008252205464)), ('Micha', 'Toby', 1.0), ('Micha', 'Jack ', (0.13483997249264842, 0.8651600275073511)), ('Micha', 'Gene ', (0.20459830184114206, 0.79540169815885797)), ('Micha', 'Claud', 1.0), ('Mick ', 'Lisa ', (0.59408852578600457, 0.21370636293028805)), ('Mick ', 'Toby', (0.92447345164190498, 0.24901011701138964)), ('Mick ', 'Jack ', (0.21128856368212914, 0.73299431171284912)), ('Mick ', 'Gene ', (0.41176470588235292, 0.41726032973743138)), ('Mick ', 'Claud', (0.56694670951384085, 0.3189317919127756)), ('Lisa ', 'Toby', (0.99124070716193036, 0.084323216321943714)), ('Lisa ', 'Jack ', (0.74701788083399601, 0.14681146067336839)), ('Lisa ', 'Gene ', (0.39605901719066977, 0.43697492654267506)), ('Lisa ', 'Claud', (0.56694670951384085, 0.3189317919127756)), ('Toby', 'Jack ', (0.66284898035987017, 0.53869426797895403)), ('Toby', 'Gene ', (0.38124642583151169, 0.75098988298861025)), ('Toby', 'Claud', (0.89340514744156441, 0.29661883133160016)), ('Jack ', 'Gene ', (0.96379568187563314, 0.0082243534847899202)), ('Jack ', 'Claud', (0.028571428571428571, 0.9714285714285712)), ('Gene ', 'Claud', (0.31497039417435602, 0.60570041941160946))]