Extract columns from a html page

I am trying to parse information from a html page which looks like this:

Column 1 | Column 2 | Column 3 ....

This is the code I have so far:

from bs4 import BeautifulSoup as BS
import urllib.request
html=urllib.request.urlopen(url)
soup=BS(html,"lxml")

But I can't seem to figure out how I can extract, say column 1 from that html page and put it into a dataframe in python.

2 answers

  • answered 2018-03-22 18:57 krflol

    I would recommend looking in to pandas. Once you have your html in memory you can try a

    import pandas as pd
    
    df = pd.read_html(myHtml)
    

    it works pretty well.

  • answered 2018-03-22 18:57 Ajax1234

    You can scrape the table data and then add to a dataframe:

    from bs4 import BeautifulSoup as soup
    import urllib
    import pandas as pd
    page_data = str(urllib.urlopen('http://mlg.ucd.ie/modules/COMP30760/stocks/tlsa.html').read())
    final_data = [i.text for i in soup(page_data, 'html.parser').find_all('td')]
    last_data = [final_data[i:i+7] for i in range(0, len(final_data), 7)]
    df = pd.DataFrame(last_data[1:], columns = last_data[0])
    

    Output (sample)

         Day Month  Year        Open        High         Low       Close
    0     02    01  2013          35   35.450001   34.709999   35.360001
    1     03    01  2013       35.18   35.450001       34.75       34.77
    2     04    01  2013   34.799999   34.799999   33.919998   34.400002
    3     07    01  2013   34.799999   34.799999   33.900002       34.34
    4     08    01  2013        34.5        34.5   33.110001       33.68
    5     09    01  2013   34.009998   34.189999   33.400002   33.639999
    6     10    01  2013   33.869999   33.990002   33.380001   33.529999
    7     11    01  2013   34.040001   34.040001   32.110001       32.91
    8     14    01  2013   33.080002   33.380001   32.849998   33.259998