Numpy np.newaxis
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.
Thanks In advance
1 answer

df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc')) In [22]: df Out[22]: a b c 0 4 3 8 1 7 5 6 2 1 3 9 3 7 5 7 4 7 0 6 In [23]: from sklearn.preprocessing import StandardScaler In [24]: df['a'].shape Out[24]: (5,) # < 1D array In [25]: df['a'][:, np.newaxis].shape Out[25]: (5, 1) # < 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape Out[26]: (5, 1) # < 2D array In [27]: StandardScaler().fit_transform(df[['a']]) Out[27]: array([[0.5 ], [ 0.75], [1.75], [ 0.75], [ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a']) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t o float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) Out[28]: array([0.5 , 0.75, 1.75, 0.75, 0.75])
See also questions close to this topic

Is it possible to remove / replace arrays with 0 samples from the data?
I have a code which goes through the CSV files and tries to make prediction from them. Whenever I try to run the program I stumble upon error of arrays with 0 samples; is it possible to somehow replace or remove those arrays to process the data?
This is the stack trace:
Traceback (most recent call last): File "main.py", line 138, in <module> stock_list = Analysis() File "p26.py", line 98, in Analysis X, y, Z = Build_Data_Set() File "p26.py", line 85, in Build_Data_Set X = preprocessing.scale(X) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sitepackages/sklearn/preprocessing/data.py", line 133, in scale dtype=FLOAT_DTYPES) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sitepackages/sklearn/utils/validation.py", line 431, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 35)) while a minimum of 1 is required by the scale function.
This is the code:
import numpy as np import matplotlib.pyplot as plt from sklearn import svm, preprocessing import pandas as pd from matplotlib import style import statistics from collections import Counter def clean_dataset(df): assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame" df.dropna(inplace=True) indices_to_keep = ~df.isin([np.nan, np.inf, np.inf]).any(1) return df[indices_to_keep].astype(np.float64) style.use("ggplot") how_much_better = 5 FEATURES = [ 'DE Ratio', 'Trailing P/E', 'Price/Sales', 'Price/Book', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue Per Share', 'Market Cap', 'Enterprise Value', 'Forward P/E', 'PEG Ratio', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Revenue', 'Gross Profit', 'EBITDA', 'Net Income Avl to Common ', 'Diluted EPS', 'Earnings Growth', 'Revenue Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Current Ratio', 'Book Value Per Share', 'Cash Flow', 'Beta', 'Held by Insiders', 'Held by Institutions', 'Shares Short (as of', 'Short Ratio', 'Short % of Float', 'Shares Short (prior ' ] def Status_Calc(stock, sp500): difference = stock  sp500 if difference > how_much_better: return 1 else: return 0 def Build_Data_Set(): # data_df = pd.DataFrame.from_csv("key_stats_acc_perf_WITH_NA_enhanced.csv") data_df = pd.DataFrame.from_csv("key_stats_acc_perf_NO_NA_enhanced.csv") # shuffle data: data_df = data_df.reindex(np.random.permutation(data_df.index)) data_df = clean_dataset(data_df.replace("NaN",0).replace("N/A",0)) # data_df = data_df.replace("NaN",999).replace("N/A",999) data_df["Status2"] = list(map(Status_Calc, data_df["stock_p_change"], data_df["sp500_p_change"])) X = np.array(data_df[FEATURES].values)#.tolist()) y = ( data_df["Status2"] .replace("underperform",0) .replace("outperform",1) .values.tolist() ) X = preprocessing.scale(X) Z = np.array( data_df[ ["stock_p_change", "sp500_p_change"] ] ) return X,y,Z def Analysis(): test_size = 1 invest_amount = 10000 # dollars total_invests = 0 if_market = 0 if_strat = 0 X, y, Z = Build_Data_Set() print(len(X)) clf = svm.SVC(kernel="linear", C=1.0) clf.fit(X[:test_size],y[:test_size]) # train data correct_count = 0 for x in range(1, test_size+1): invest_return = 0 market_return = 0 if clf.predict(X[x])[0] == y[x]: # test data correct_count += 1 if clf.predict(X[x])[0] == 1: invest_return = invest_amount + (invest_amount * (Z[x][0] / 100.0)) market_return = invest_amount + (invest_amount * (Z[x][1] / 100.0)) total_invests += 1 if_market += market_return if_strat += invest_return data_df = pd.DataFrame.from_csv("forward_sample_NO_NA.csv") # data_df = pd.DataFrame.from_csv("forward_sample_WITH_NA.csv") data_df = clean_dataset(data_df.replace("NaN",0).replace("N/A",0)) X = np.array(data_df[FEATURES].values) X = preprocessing.scale(X) Z = data_df["Ticker"].values.tolist() invest_list = [] for i in range(len(X)): p = clf.predict(X[i])[0] if p == 1: # print(Z[i]) invest_list.append(Z[i]) # print(len(invest_list)) # print(invest_list) return invest_list # Analysis() final_list = [] loops = 8 for x in range(loops): stock_list = Analysis() for e in stock_list: final_list.append(e) x = Counter(final_list) print('_'*120) for each in x: if x[each] > loops  (loops/3): print(each)
I already wrote function clean_dataset to replace all NaN, infinity or too long values, but I'm not sure how to deal with this arrays.

How to merge two datasets by specific column in pandas
I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18dataset.
My problem is the namecolumn in these two datasets are using different format.
For example: "lionel messi" in one dataset and in the other it is "L. Messi"
I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.
What would be the most intelligent way to go about this?

How to index into a data frame using another data frame's indices?
I have a dataframe,
num_buys_per_day
Date count 0 20110113 1 1 20110202 1 2 20110303 2 3 20110603 1 4 20110801 1
I have another data frame
commissions_buy
which I'll give a small subset of:num_orders 20110110 0 20110111 0 20110112 0 20110113 0 20110114 0 20110118 0
I want to apply the following command
commissions_buy.loc[num_buys_per_day.index, :] = num_buys_per_day.values * commission
where
commission
is a scalar.Note that all indices in
num_buys_per_day
exist incommissions_buy
.I get the following error:
TypeError: unsupported operand type(s) for *: 'Timestamp' and 'float'
How should I do the correct command?

Reading binary data on bit level
I have a binary file in which the data is organised in 16 bit integer blocks like so:
 bit 15: digital bit 1
 bit 14: digital bit 2
 bits 13 to 0: 14 bit signed integer
The only way that I found how to extract the data from file to 3 arrays is:
data = np.fromfile("test1.bin", dtype=np.uint16) digbit1 = data >= 2**15 data = np.array([x  2**15 if x >= 2**15 else x for x in data], dtype=np.uint16) digbit2 = data >= 2**14 data = np.array([x2**14 if x >= 2**14 else x for x in data]) data = np.array([x2**14 if x >= 2**13 else x for x in data], dtype=np.int16)
Now I know that I could do the same with with the for loop over the original data and fill out 3 separate arrays, but this would still be ugly. What I would like to know is how to do this more efficiently in style of
dtype=[('db', [('1', bit), ('2', bit)]), ('temp', 14bitsignedint)])
so that it would be easy to access likedata['db']['1'] = array of ones and zeros
. 
Get SystemError: Parent module '' not loaded, cannot perform relative import when trying to import numpy in a Cython Extension
I have a cython extension inside a package which is structured like so:
packagename ├── MANIFEST.in ├── packagename │ ├── __init__.py │ ├── packagename.py │ ├── subpackage1 │ │ ├── __init__.py │ │ ├── subpackage1.py │ │ └── cythonExt1.pyx │ ├── subpackage2 │ │ ├── __init__.py │ │ ├── subpackage2.py │ │ └── cythonExt2.pyx │ └── VERSION ├── requirements.txt └── setup.py
When I try to add a line in cythonExt2.pyx which imports numpy I get the following error:
 SystemError Traceback (most recent call last) <ipythoninput55c4ef4d8efd3> in <module>() 1 # Calling Functions of Interest > 2 import pacakge.subpackage2 as thingy 3 import numpy as np 4 import matplotlib.pyplot as plt /home/user/anaconda2/envs/python3/lib/python3.5/sitepackages/pacakage0.0.3py3.5linuxx86_64.egg/package/__init__.py in <module>() 18 19 # the following line imports all the functions from package.py > 20 from .package import * 21 import package.subpackage1 22 import package.subpackage2 /home/user/anaconda2/envs/python3/lib/python3.5/sitepackages/package0.0.3py3.5linuxx86_64.egg/package/package.py in <module>() 1 from package.subpackage1 import thingy1 > 2 from package.subpackage2 import thingy2 3 import numpy as _np 4 from multiprocessing import Pool as _Pool /home/user/anaconda2/envs/python3/lib/python3.5/sitepackages/package0.0.3py3.5linuxx86_64.egg/package/subpackage2/subpackage2.py in <module>() 3 import os > 4 from cythonExt2 import solve as solve_cython 5 from frange import frange 6 /home/user/anaconda2/envs/python3/lib/python3.5/sitepackages/package0.0.3py3.5linuxx86_64.egg/package/subpackage2/cythonExt2.pyx in init package.subpackge2.cythonExt2 (package/subpackage2/cythonExt2.c:6158)() 1 cimport numpy > 2 import numpy 3 cimport cython 4 5 def get_z_n(n, z): SystemError: Parent module '' not loaded, cannot perform relative import
If I just cimport numpy this works and I have access to the numpy C API but I cannot import the numpy python functions which I need to solve a particular problem.
Why is this and how might I fix it?
I'm wondering if there is an issue with my setup file that is causing this to not work. The cython parts of my setup.py file are like so:
from setuptools import setup from setuptools.extension import Extension from Cython.Build import cythonize from Cython.Build import build_ext extensions = [Extension( name="cythonExt1", sources=["package/subpackage1/cythonExt1.pyx"], include_dirs=[numpy.get_include()], ), Extension( name="cythonExt2", sources=["package/subpackage2/cythonExt2.pyx"], include_dirs=[numpy.get_include()], ) ] setup(name='package', ... include_package_data=True, packages=['package', 'package.subpackage1', 'package.subpackage2', ], ext_modules = cythonize(extensions), install_requires=requirements, )

Python  Addition of numpy arrays with different shapes
I have two numpy arrays of the shapes (4, 1) and (4,) respectively. I want to add these two arrays the conventional way to get an array with four elements. For example,
A = np.array([[3],[5],[4],[2]]) B = np.array([3,5,4,2])
I want C= A+B to be np.array([6,10,8,4]) (preferably of the shape (4,). But when, I add these two arrays (A and B), I get a 4*4 matrix.
print A+B [[ 6. 8. 7. 5.] [ 8. 10. 9. 7.] [ 7. 9. 8. 6.] [ 5. 7. 6. 4.]]
What am I missing and how can I achieve the functionality that I am aiming for?

Empty factor levels were dropped for columns when using MLR package
I have a question here, when I try to use "makeClassifTask" from MLR package to do a SVM, a warning said Empty factor levels were dropped for columns. My codes are:
install.packages("mlr") library(mlr) set.seed(1) sample=sample(2,nrow(cleaned_caravan_train),replace=T) train=cleaned_caravan_train[sample==1,] test=cleaned_caravan_train[sample==2,] makeClassifTask(data=train,target = "CARAVAN")
An example from the MLR package works very well:
install.packages("mlbench") library(mlbench) data("BostonHousing") data("Ionosphere") makeClassifTask(data=iris,target="Species")
I don't understand what the different between those.

Dialogflow: selecting specific value for the action while training
Is it possible to select a specific value for the action in the training or intent tab?
For ex.: I have an entity PLACES and there are a lot of places in the city, I try to keep tons of synonyms per each.
Let's say there is a place called City Museum and synonyms are "museum, city museum, cit mus, meseum" and so on, with mistakes or other aliases.
Currently, I have to add them manually, as while training there is no way to select a specific value for the entity. As I select proper intent, then entity, but Dialogflow creates a new value for the words it doesn't yet know, rather than adding them to the list.
Is there any way to do it this way?

what is wrong with my cosine similarity? Tensorflow
I want to use cosine similarity in my Neural network, instead of the standard dot product.
I've had a look at the dot product and at the cosine similarity.
In the example above they use
a = tf.placeholder(tf.float32, shape=[None], name="input_placeholder_a") b = tf.placeholder(tf.float32, shape=[None], name="input_placeholder_b") normalize_a = tf.nn.l2_normalize(a,0) normalize_b = tf.nn.l2_normalize(b,0) cos_similarity=tf.reduce_sum(tf.multiply(normalize_a,normalize_b)) sess=tf.Session() cos_sim=sess.run(cos_similarity,feed_dict={a:[1,2,3],b:[2,4,6]})
However, I tried doing it my own way
x = tf.placeholder(tf.float32, [None, 3], name = 'x') # input has 3 features w1 = tf.placeholder(tf.float32, [10, 3], name = 'w1') # 10 nodes in the first hidden layer cos_sim = tf.divide(tf.matmul(x, w1), tf.multiply(tf.norm(x), tf.norm(w1))) with tf.Session() as sess: sess.run(cos_sim, feed_dict = {x = np.array([[1,2,3], [4,5,6], [7,8,9], w1: np.random.uniform(0,1,size = (10,3) )})
Is my way wrong? Also, what is going on in the matrix multiplication? Are we actually multiplying the weights of one node for the inputs of different samples (within one feature)?

Getting positional index out of bounds error
cross_val_score(model,trainX,targets,cv=5)
in scikitlearn i used the above line. I am getting positional index out of bounds error. The shape of trainX is(891,66) and targets is(891,)

Time series regression  RandomForest
Apologies for the dumb questions  total n00b here.
Let's say I have the following dataset.
date,site,category,locale,type,rank,sessions,logins 01/01/2017,google.com,search,US,free,1,3393093,50000 01/01/2017,google.com,overall,US,free,1,3393093,50000 01/01/2017,yahoo.com,search,US,3,free,core,393093,40000 01/01/2017,yahoo.com,news,US,9,free,393093,40000 01/01/2017,yahoo.com,overall,US,23,free,393093,40000 01/01/2017,wsj.com,news,US,21,free,200000,180000 01/01/2017,wsj.com,news,US,21,subscription,200000,180000 01/01/2017,wsj.com,overall,US,93,free,200000,180000
where rank is the Alexa rank of that site. There are several categories possible (search, email, ecommerce etc) and the rank corresponds to the rank within that category.
I am trying to predict the number of sessions and logins a particular site/locale/rank would have for a particular day, essentially boiling this down to a multivariate time series regression problem and I am using sklearn's RandomForestRegressor.
Right now I don't treat this as a time series problem at all  for training, I remove the
date
andsite
columns, encode thecategory
,locale
andrank
columns, use them andrank
as inputs and train my model to predictsessions
andlogins
. The results look decent but I wanted to know:How could this be converted into a proper time series prediction? I saw some examples by Jason Brownlee where the problem was reframed as a supervised learning problem  but this wouldn't work as I have potentially millions of rows of training data. I could group the training data by category/locale/type, sort by date and for testing at day T for a particular category/locale/type combination, use data up to day T1 for training  but this approach would be very expensive as there are potentially thousands of such category/locale/type combinations
I've read about using moving averages to boost performance. Calculating the moving averages of
sessions
andlogins
in the training set would be trivial, but since this is a dependent variable, how would I capture this in the test set?Is there a better tool than RF for this task?

Sklearn: using TfidfVectorizer
I have data
class 'scipy.sparse.csr.csr_matrix'
looks like(0, 55) 1 (0, 54) 1 (1, 55) 1 (1, 54) 1 (1, 55) 1 (1, 54) 1 (2, 945) 1 (2, 945) 1 (2, 950) 1 ...
I need to transform it. First I try to use
sklearn.feature_extraction.text.TfidfTransformer
but it doesn't improve value of
roc_auc
And next I try to usefrom sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(max_features=13, max_df=0.5, min_df=0.1, ngram_range=(1, 2)) data_tfidf = tfidf.fit_transform(data)
But it returns an error
AttributeError: lower not found
How can I fix that?

Data Analysis and Visualisation Web Application
I am creating a web application for a project for Data Analysis & Visualisation.
I am just looking for advice on what is the best way to approach the project, I haven't started the development of the project yet so any suggestions will be a great help.
Ideally I would like to have a clean user interface which is simple to use.
I would like the users to have the ability to upload a .csv file and the web application would take the data from the .csv file, parse it and display the data in graphs, bar charts scatter graphs etc.
When the user has created a graph/chart, this should then be displayed on a dashboard which the user will be able to customise with different charts they have previously created.
At the minute I am thinking about using:
 JavaScript/HTML/CSS/JQuery for the UI
 Firebase for user authentication/login
 I have seen the google charts library which has cool range of charts/graphs available but I am not sure as to how I would get the data from the .csv file to populate these charts
I am looking into things like Angular and D3.js at the minute.
I know that I have probably missed out a lot of the tools that I will need for this project so any help at all would be greatly appreciated.

spotfire multiple over statements in one custom expression
I have a table of travel expenses for analysis.
I would like to create a calculated column with a value for the maximum count of records with a certain category for each employee on any given day.
For example, if the category being reviewed is "dinner", we would like to know what is the maximum number of dinner transactions charged on any given day.
The following custom expression was able to count how many dinner expenses per employee:
count(If([Expense Type]="Dinner",[Expense Type],null)) over ([Employee])
But when trying to get the max count over days, I cant seem to get it to work. Here is the expression used:
Max(count(If([Expense Type]="Dinner",[Expense Type],null)) over ([Employee])) over (Intersect([Employee],[Transaction Date]))
This seems to provide the same answer as the first expression. Any idea on how to get this code to identify the value on the date with the most expenses for each employee?

Machine learning regression spline
I understand degree of freedom when it comes to other aspect of statistics and regression, but when it comes to regression spline I Dont understand what it means by degree of freedom