BeaufitulSoup Remove br in table and add full link

I am finally getting closer to extract a table from a specific website but my problem is that I cannot seem to figure out how to

  1. Display the full link to the download file
  2. remove the br in certain rows

the html code as follows

<table border="1" cellpadding="5" cellspacing="0">
<tr class="bg">
<td><strong>Reference</strong></td>
<td stytle="width:100px"><strong>Description</strong></td>
<td><strong>Download Documents</strong></td>
<td stytle="width:50px"><strong>Closing Date</strong></td>
<td stytle="width:50px"><strong>Contact Details</strong></td>
<td><strong>Briefing</strong></td>
<!--<td><strong>PUBLISHED</strong></td>-->
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td><td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td><td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me</td>
<td>next week</td>
</tr>
</table>

I want to achieve that the br in the contact details will be removed and that the full link instead of "documents click here" is displayed.

Please be advised that this is an example table - rebuilt from the original project.

My python code works fine, just that it adds the content after
into a new link and the entire output.csv is mixed up.

!/usr/bin/env python

-- coding: utf-8 --

import csv
import requests
import os
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from bs4 import Tag 

testwebsite = 'https://example.com'

uClient = uReq(testwebsite)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

testwebsitetendersaved=""
#Table is very ugly formated in a span tag and tables within tables
testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]

for record in testwebsite_container.findAll('tr'):
    testwebsitetender=""
    for data in record.findAll('td'):
        testwebsitetender=testwebsitetender+","+data.text
    testwebsitetendersaved = testwebsitetendersaved + "\n" + testwebsitetender[1:]


header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
file = open(os.path.expanduser("output.csv"), "wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))
file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))

print(testwebsitetendersaved)

1 answer

  • answered 2018-03-22 11:22 Junhee Shin

    I hope this is what you want.

    testwebsitetendersaved=""
    #Table is very ugly formated in a span tag and tables within tables
    testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]
    
    header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
    file = open(os.path.expanduser("output.csv"), "wb")
    file.write(bytes(header, encoding="ascii",errors='ignore'))
    
    skiptrcnt=1 # skip first tr block
    for i,record in enumerate(testwebsite_container.findAll('tr')):
        if skiptrcnt>i:
            continue
        testwebsitetender=""
        tnum = record('td')[0].text
        desc = record('td')[1].text
        doclink = record('td')[2].text
        alink = record('td')[2].find("a")
        if alink :
            doclinkurl=testwebsite+alink['href']
        closingdate = record('td')[3].text
        detail = record('td')[4].text
        detail = detail.replace('\n', '')
        brief = record('td')[5].text
        brief = brief.replace('\n', '')
        print(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
        testwebsitetendersaved="{},{},{},{},{},{}\n".format(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
        file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))
    file.close()
    

    my output is

    123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
    123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
    123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me  you next week
    123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
    123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me next week
    123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me next week