Not a duplicate  How can I search for first occurrence of a number less than threshold in a 1D numpy array?
This question was incorrectly marked as a duplicate.
I have an n x 1 numpy array. I want to find the first occurrence of an entry in the array that is less than a threshold.
my code is as follows:
import numpy as np
aa = np.array([4,3,5,7])
print(aa)
np.argmin(aa<3)
output:
[ 4 3 5 7]
0
I expect argmin to return 2 but I'm getting 0. How can I make this work?
See also questions close to this topic

python: Not enough memory to perform factorization
I am using python sparse module to compute a eigenvalue problem. It is a very big sparse matrix, which would end up with large memory requirement. But the strange thing is I am using a cluster with 256GB memory which should be definitely enough for my problem. But I get the not enough memory error as below. I am wondering if anyone would give me a hint how to work around this issue?
Not enough memory to perform factorization. Traceback (most recent call last): File "init_Re620eta1_40X_2Z_omega10.py", line 158, in <module> exec_stabDiagBatchFFfollow_2D(geometry,baseFlowFolder,baseFlowVarb,baseFlowMethod,h,y_max_factor_EVP,Ny_EVP,Nz_EVP,x_p_stabDiag,x_p_orig,eigSolver,noEigs2solv,noEigs2save,SIGMA0,arnoldiTol,OmegaTol,disc_y,disc_z,y_i_factor_EVP,z_i_factor_EVP,periodicZ,BC_top,customComment,BETA,ALPHA_min,ALPHA_max,noALPHA,ALPHA_start,xp_start,u_0,nu_0,y_cut,ParallelFlowA,comm,rank,RESTART,saveJobStep,saveResultFormat) File "/lustre/cray/ws8/ws/iagyonwuRe620eta1/omega10DNS_wTS_icoPerbFoam/LST_functions_linstab2D_Temperal_mpi_multiTracking4.py", line 3893, in exec_stabDiagBatchFFfollow_2D OMEGA, eigVecs = sp.linalg.eigs(L0, k=noEigs2solv, sigma=SIGMA, v0=options_v0, tol=arnoldiTol) File "/opt/python/3.6.1.1/lib/python3.6/sitepackages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1288, in eigs symmetric=False, tol=tol) File "/opt/python/3.6.1.1/lib/python3.6/sitepackages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1046, in get_OPinv_matvec return SpLuInv(A.tocsc()).matvec File "/opt/python/3.6.1.1/lib/python3.6/sitepackages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 907, in __init__ self.M_lu = splu(M) File "/opt/python/3.6.1.1/lib/python3.6/sitepackages/scipy/sparse/linalg/dsolve/linsolve.py", line 267, in splu ilu=False, options=_options) MemoryError

Python 2.7 Probability Histogram with Line of Best Fit Won't Plot
I am trying to plot a histogram from a long list of probabilities using this code:
#create a histogram showing the probability distribution for the triangles n, bins, patches=plt.hist([prob_long], bins=[0.36,0.365,0.37,0.375,0.38,0.385,0.39,0.395,0.40,0.405,0.41,0.415,0.42,0.425,0.43,0.435,0.44,0.445,0.45,0.455,0.46,0.465,0.47,0.475,0.48,0.485,0.49,0.495,0.5], normed=(True), facecolor='orange', alpha=0.75, ec='black') #plot a line of best fit (mu,sigma)=norm.fit(prob_long) y=mlab.normpdf( bins, mu, sigma) l=plt.plot(bins, y, 'r', linewidth=1) plt.xlabel('Probabilities') plt.ylabel('Frequency') plt.title("""Histogram of the probability distribution of the longest side of Pythagorean triples""") plt.xlim([0.35,0.50]) plt.grid(True) plt.show()
I first used a set value for the bins but it just gave me one bar on the histogram so I entered the values myself. Now it gives me a variety of probabilities but the frequencies are all at 200 for reasons unbeknownst to me. Any help would be appreciated. Thanks. I have attached a picture of the histogram I get.Here

Linux Python  Character after function
I am trying to place a "%" after my function output, however I'm met with a syntax error.
print("Free memory:") print(free_mem) + "% "))

global variables across files and numpy behavior
I have three files:
bar.py
,foo.py
, andmain.py
.# bar.py import numpy as np global y x=0 y=0 z=np.array([0])
# foo.py from bar import * def foo(): x=1 y=1 z[0]=1
# main.py from foo import * from bar import * print(x,y,z) # 0 0 [0] foo() print(x,y,z) # 0 0 [1]
Question: Why did
x
andy
not change their values whilez
did change value of its element? And, how should I write so that I can change values ofx
andy
, which can also be accessible from other files?Normally I'd never write in this fashion, which was forced when translating an archaic
FORTRAN77
program intopython
.The original code heavily uses common blocks and includes, so basically I cannot trace the declarations of all variables. But still I wanted to preserve the original style of the code, so I tried to make a "global variables module", whose variables can be modified from any part of the program.
Back to my question, my guess is that
numpy.ndarray
is just pointer, and we do not change the value of a pointer, soz
has changed. But even then the behavior ofz
seems very dangerous, that I cannot trustz
to be shared as a global variable and its value is the same across all files. Who knows thatz
inmain
andfoo
are pointing the same memory sector?Moreover, how can I make some variables truly global? Actually when I tried to translate that FORTRAN program, I tried to make class and instances of them, then pass the instance over the arguments of the function, then I realized that requires modifying the code tremendously.
What can I do?

xaxis in matplotlib is over croweded
I had csv file, I convert it into pandas dataframe to perform some analysis, but I am facing a problem.
the pandas dataframe looks like this
print (df.head(10) cusid date value_limit 0 10173 20110612 455 1 95062 20110911 455 2 171081 20110705 212 3 122867 20110818 123 4 107186 20111123 334 5 171085 20110902 376 6 169767 20110703 34 7 80170 20110323 34 8 154178 20111002 34 9 3494 20110101 34
I am trying to plot xasis ticks as date. since the minimum date in date column is 20110101 and maximum date is 20120420.
I tried something like this
import pandas as pd import numpy as np import matplotlib.pyplot as plt import datetime import matplotlib.dates as mdates df = pd.read_csv('rio_data.csv', delimiter=',') print (df.head(10)) date_freq = df.date.value_counts(dropna=False) plt.plot(date_freq, marker='.', linestyle='none')
and the plot look like this, it is clear that xaxis is really very populated.
I am trying to set date as xaxis im matplotlib, but since I had few hundred days, it becomes overcrowded and I can't analyse it properly. I need to set xaxis in months instead of days, but I am new to matplotlib and I spend whole day but unable to do it properly. Any help will really be appreciated.

How to compute Relative Error Reduction function in Python
As a beginner in Machine Learning and I try to practice some things, but I have struggles to write a function to compute a Relative Error Reduction with numpy and/or scikit.
I want to compute the error rate reduction between predicted labels A and predicted labels B, with reference to true labels T. So, which labels are predicted wrong.
I have three inputs:
1D numpy array with predicted labels A
A = np.array(['red', 'red', 'red', 'red'])
1D numpy array with predicted labels B
B = np.array(['red', 'red', 'blue', 'red'])
1D numpy array with true labels T
T = np.array(['red', 'green', 'blue', 'red'])
I feel pretty silly that I cannot come up with a solution. Does someone know how to write such a function?

return max distance with scipy.ndimage.distance_transform_edt
We know that the
scipy.ndimage.distance_transform_edt
function (documentation) computes the distance from nonzero (i.e. nonbackground) points to the nearest zero (i.e. background) point.
In my particular context, my aim is to compute the distance from nonzero points to the farthest zero point.
How can I do this with this function ?I searched for a specific flag but it seems that it doesn't exist since a distance transform returns always the smallest distance.
Can anyone propose a trick to solve this issue ?

Why would SciPy's interp1d take over a minute to build an interpolator?
I'd like to quad or cube interpolate a long series of floats (or vectors) in 1d, where long could be 1E+05 or 1E+06 (or more). For some reason SciPi's handy
interp1d()
's time overhead to prepare the interpolators scales as almost n^3 for both quadratic and cubic splines, taking over a minute for a few thousand points.According to comments here (a question I will delete, I'm keeping it there temporarily for comment access) it takes a millisecond on other computers, so something is obviously pathologically wrong here.
My installation is a bit old, but SciPy's
.interp1d()
has been around for quite a while.np.__version__ '1.13.0' scipy.__version__ '0.17.0'
What can I do to try to figure out this incredible slowness for interpolation?
import time import numpy as np import matplotlib.pyplot as plt from scipy.interpolate import interp1d times = [] for n in np.logspace(1, 3.5, 6).astype(int): x = np.arange(n, dtype=float) y = np.vstack((np.cos(x), np.sin(x))) start = time.clock() bob = interp1d(x, y, kind='quadratic', assume_sorted=True) times.append((n, time.clock()  start)) n, tim = zip(*times) plt.figure() plt.plot(n, tim) plt.xscale('log') plt.yscale('log') plt.show()

Converting an Image matrix with decimal values imported from matlab to an image, in python
I have a data in the form of a matrix that was imported from matlab
.mat
format usingscipy.io.loadmat()
and I have been trying to get an image out of this matrix consisting of decimal values, I used theImage.toarray()
method from PIL withmode = 'L'
and have not been getting the desired image. I have tried converting the image touint8
with.astype()
but no luck there too.What could possibly be wrong here? Thanks
Random sample of the matrix looks like
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.20939216e18, 6.72618320e04, 1.13151411e02, 3.54641066e02, 3.88214912e02, 3.71077412e02, 1.33524928e02, 9.90964718e04, 4.89176960e05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]
My snippet to display the image out of it :
matIn = spio.loadmat(r'ex3data1.m', squeeze_me = True) X, y = matIn['X'], matIn['y'] data = X[0] img = Image.fromarray(data[0].astype('uint8').reshape(20,20)) img.show() Here X is (5000,400) matrix and data is a (400,) vector
PS : I am currently implementing the excersises from Andrew NG's ML course

Why increase of array alignment degrades performance?
I am trying to increase aligment of array in synthetic test from 16 to 32 and performance degrades from ~4100ms to ~4600ms. How higher aligment can harm performance?
Below is code which I use for testing (I am trying to utilize avx instructions here). Build with
g++ test.cpp O2 ftreevectorize mavx2
(I have no support of avx512).#include <chrono> #include <iostream> #include <memory> #include <cassert> #include <cstring> #include <cstdlib> using Time = std::chrono::time_point<std::chrono::system_clock>; using Clock = std::chrono::system_clock; template <typename Duration> auto as_ms(Duration const& duration) { return std::chrono::duration_cast<std::chrono::milliseconds>(duration); } static const int repeats = 10000; struct I { static const int size = 524288; int* pos; I() : pos(new int[size]) { for (int i = 0; i != size; ++i) { pos[i] = i; } } ~I() { delete pos; } }; static const int align = 16; // try to change here 16 (4100 ms) / 32 (4600 ms) struct S { static const int size = I::size; alignas(align) float data[size]; S() { for (int i = 0; i != size; ++i) { data[i] = (i * 7 + 11) % 2; } } }; void foo(const I& p, S& a, S& b) { const int chunk = 32; alignas(align) float aprev[chunk]; alignas(align) float anext[chunk]; alignas(align) float bprev[chunk]; alignas(align) float bnext[chunk]; const int N = S::size / chunk; for (int j = 0; j != repeats; ++j) { for (int i = 1; i != N1; i++) { int ind = p.pos[i] * chunk; std::memcpy(aprev, &a.data[ind1], sizeof(float) * chunk); std::memcpy(anext, &a.data[ind+1], sizeof(float) * chunk); std::memcpy(bprev, &b.data[ind1], sizeof(float) * chunk); std::memcpy(bnext, &b.data[ind+1], sizeof(float) * chunk); for (int k = 0; k < chunk; ++k) { int ind0 = ind + k; a.data[ind0] = (b.data[ind0]  1.0f) * aprev[k] * a.data[ind0] * bnext[k] + a.data[ind0] * anext[k] * (bprev[k]  1.0f); } } } } int main() { S a, b; I p; Time start = Clock::now(); foo(p, a, b); Time end = Clock::now(); std::cout << as_ms(end  start).count() << std::endl; float sum = 0; for (int i = 0; i != S::size; ++i) { sum += a.data[i]; } return sum; }
Checking if cache cause the problem:
valgrind tool=cachegrind ./a.out
alignment = 16:
==4352== I refs: 3,905,614,100 ==4352== I1 misses: 1,626 ==4352== LLi misses: 1,579 ==4352== I1 miss rate: 0.00% ==4352== LLi miss rate: 0.00% ==4352== ==4352== D refs: 2,049,454,623 (1,393,712,296 rd + 655,742,327 wr) ==4352== D1 misses: 66,707,929 ( 66,606,998 rd + 100,931 wr) ==4352== LLd misses: 66,681,897 ( 66,581,942 rd + 99,955 wr) ==4352== D1 miss rate: 3.3% ( 4.8% + 0.0% ) ==4352== LLd miss rate: 3.3% ( 4.8% + 0.0% ) ==4352== ==4352== LL refs: 66,709,555 ( 66,608,624 rd + 100,931 wr) ==4352== LL misses: 66,683,476 ( 66,583,521 rd + 99,955 wr) ==4352== LL miss rate: 1.1% ( 1.3% + 0.0% )
alignment = 32
==4426== I refs: 2,857,165,049 ==4426== I1 misses: 1,604 ==4426== LLi misses: 1,560 ==4426== I1 miss rate: 0.00% ==4426== LLi miss rate: 0.00% ==4426== ==4426== D refs: 1,558,058,149 (967,779,295 rd + 590,278,854 wr) ==4426== D1 misses: 66,706,930 ( 66,605,998 rd + 100,932 wr) ==4426== LLd misses: 66,680,898 ( 66,580,942 rd + 99,956 wr) ==4426== D1 miss rate: 4.3% ( 6.9% + 0.0% ) ==4426== LLd miss rate: 4.3% ( 6.9% + 0.0% ) ==4426== ==4426== LL refs: 66,708,534 ( 66,607,602 rd + 100,932 wr) ==4426== LL misses: 66,682,458 ( 66,582,502 rd + 99,956 wr) ==4426== LL miss rate: 1.5% ( 1.7% + 0.0% )
Seems like the problem is not in cache.

Efficient sampling of factor variable from dataframe subsets
I have a dataframe
df1
which contains 6 columns, two of which (var1
&var3
) I am using tosplit
df1
by, resulting in a list of dataframesls1
.For each sub dataframe in
ls1
I want tosample()
x$var2
,x$num
times withx$probs
probabilities as follows:Create data:
var1 < rep(LETTERS[seq( from = 1, to = 3 )], each = 6) var2 < rep(LETTERS[seq( from = 1, to = 3 )], 6) var3 < rep(1:2,3, each = 3) num < rep(c(10, 11, 13, 8, 20, 5), each = 3) probs < round(runif(18), 2) df1 < as.data.frame(cbind(var1, var2, var3, num, probs)) ls1 < split(df1, list(df1$var1, df1$var3))
have a look at the first couple list elements:
$A.1 var1 var2 var3 num probs 1 A A 1 10 0.06 2 A B 1 10 0.27 3 A C 1 10 0.23 $B.1 var1 var2 var3 num probs 7 B A 1 13 0.93 8 B B 1 13 0.36 9 B C 1 13 0.04
lapply
overls1
:ls1 < lapply(ls1, function(x) { res < table(sample(x$var2, size = as.numeric(as.character(x$num)), replace = TRUE, prob = as.numeric(as.character(x$probs)))) res < as.data.frame(res) cbind(x, res = res$Freq) }) df2 < do.call("rbind", ls1) df2
Have a look at the first couple list elements of the result:
$A.1 var1 var2 var3 num probs res 1 A A 1 10 0.06 2 2 A B 1 10 0.27 4 3 A C 1 10 0.23 4 $B.1 var1 var2 var3 num probs res 7 B A 1 13 0.93 10 8 B B 1 13 0.36 3 9 B C 1 13 0.04 0
So for each dataframe a new variable
res
is created, the sum ofres
equalsnum
and the elements ofvar2
are represented inres
in proportions relating toprobs
. This does what I want but it becomes very slow when there is a lot of data.My Question: is there a way to replace the
lapply
piece of code with something more efficient/faster?I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.
ls1
is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).Any help would be much appreciated.

Python Pandas  mapping the values in two data frames
I have two dataframes
df1
anddf2
.I am trying to figure out the best way to perform a mapping that for each row in
df1
, I would like to search for a match (id
,time_by_hour
) indf2
and then fill the corresponding value indf2
back intodf1
.Below is the
df1_final
as I would like it to look finally.Thank you in advance!
df1 Out[100]: id time_by_min time_by_hour value 0 a 20170630 01:25:00.000 20170630 02:00:00 NaN 1 a 20170630 01:36:32.308 20170630 02:00:00 NaN 2 a 20170630 02:25:00.000 20170630 03:00:00 NaN 3 a 20170630 02:36:32.308 20170630 03:00:00 NaN 4 b 20170630 01:25:00.000 20170630 02:00:00 NaN 5 b 20170630 01:36:32.308 20170630 02:00:00 NaN 6 b 20170630 02:25:00.000 20170630 03:00:00 NaN df2 Out[101]: id time_by_hour value 0 a 20170630 02:00:00 100 1 a 20170630 03:00:00 200 2 b 20170630 02:00:00 150 3 b 20170630 03:00:00 30 4 c 20170630 02:00:00 80 5 c 20170630 03:00:00 900 df1_final Out[102]: id time_by_min time_by_hour value 0 a 20170630 01:25:00.000 20170630 02:00:00 100 1 a 20170630 01:36:32.308 20170630 02:00:00 100 2 a 20170630 02:25:00.000 20170630 03:00:00 200 3 a 20170630 02:36:32.308 20170630 03:00:00 200 4 b 20170630 01:25:00.000 20170630 02:00:00 150 5 b 20170630 01:36:32.308 20170630 02:00:00 150 6 b 20170630 02:25:00.000 20170630 03:00:00 30