Iterating html through tag classes with BeautifulSoup

I'm saving some specific tags from webpage to an Excel file so I have this code:

`import requests
from bs4 import BeautifulSoup
import openpyxl

url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")

wb = openpyxl.Workbook()
ws = wb.active

tagiterator = soup.h2

row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()

while tagiterator.find_next():
    if tagiterator.name == 'h2':
        row += 1
        col = 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
    elif tagiterator.name == 'span':
        col += 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()

wb.save('DG3test.xlsx')`

It works, but I want exclude some tags. I want to get only that h2 tags which have 'product-name' class and that span tags which have 'attribute-value' class. I tried to do this by:

tagiterator['class'] == 'product-name'

tagiterator.hasClass('product-name')

tagiterator.get

And some more which also didn't worked.

Values I want are visible in this poor image I created: https://ibb.co/eWLsoQ and url is in the code.

1 answer

  • answered 2017-06-17 18:26 Elvir Muslic

    What I did not include is writing it to an excel file, hopefully, that's something you can do, nevertheless, just write a comment and I'll include the code for this. Logic applies, write product information, add row+=1 and column then resets the column...(why do we do this? so the product stays within the same row :). something you've already done

    from bs4 import BeautifulSoup
    
    import requests
    
    header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
    
    
    url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
    soup = BeautifulSoup(url, 'lxml')
    
    find_products = soup.findAll('div',{'class':'product-row'})
    
    for item in find_products:
        title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
        # print(title_text)
        display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
        # print(display)
        functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
        list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
        #Now you can store them or do-smt...
    
        for funcs in list_of_funcs:
            print(funcs.text.strip())
    

    Algorithm:

    1. We find each product
    2. We find tags within each product and extract the relevant information
    3. We use the .text to extract only the text portion
    4. We use for loops to iterate through each product and then iterate through their Functions or the tag that contains the capabilities of product.