Python Regex for html tags

Question

I´m trying to get rid of some elements of the HTML code before using an html parser. I´m pretty new to regex and thats why I have problems understanding the syntax.

Parts of my html-code look like this:

<div class="footer" id="footer">
 <other tags> ... bla ... </other tags>
</div>

But it appears that the same "part" of the page can be written differently on a certain sub-page, like this:

<div id="footer" class="footer">
 <other tags> ... bla ... </other tags>
</div>

The thing I achieved is to get rid of specific cases:

footer = re.sub('<div class="footer" id="footer">.*?</div>','',html)

But what I want is a Regex that is more general, so if he should get rid of every the parts when, e.g. "id="footer" no matter whats in front or behind it

<div ... id="footer" ...> 
<other tags> ... bla ... </other tags>    
</div>

EDIT: before getting "hated", I´m pretty new to HTML parsers too.

Thanks for the help!

MG

change .*? to [\s\S]*? or use flags=re.DOTALL, and of course if there is </div> inside it won't work, use HTML parser instead. — YOU
– YOU, Commented Jan 3, 2017 at 12:56
I just realized that I can find the respective part with soup.findAll('div',{'id':'footer'})can I also get rid of this parts with the HTML parser? — bootica
– bootica, Commented Jan 3, 2017 at 13:12
why you want to get rid of it? with soup only select div which you need. — Bhavesh Ghodasara
– Bhavesh Ghodasara, Commented Jan 3, 2017 at 13:15

Mohammad Yusuf · Accepted Answer · 2017-01-03 13:33:16Z

1

Why would you want to remove it? As Bhavesh said just select the ones which you want. But if you want to know if we can remove them then yes you can get rid of them by decompose()

a="""
<div class="footer" id="footer">
 <p>lskjdf</p>
</div>

<div id="not_footer" class="footer">
<p>lskjdf</p>
</div>
"""
b = BeautifulSoup(a)
print b
print '---------------------'
print '---------------------'
for c in b.select('div#footer'):
    c.decompose()
print b

Output:

<html><body><div class="footer" id="footer">
<p>lskjdf</p>
</div>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>
---------------------
---------------------
<html><body>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>

edited Jan 3, 2017 at 13:33

answered Jan 3, 2017 at 13:26

Mohammad Yusuf

17.1k12 gold badges60 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bootica Over a year ago

This is very useful for me and basically excactly what I want. I just want to use everything else except the parts that are called "footer"

Collectives™ on Stack Overflow

Python Regex for html tags

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related