0

I´m trying to get rid of some elements of the HTML code before using an html parser. I´m pretty new to regex and thats why I have problems understanding the syntax.

Parts of my html-code look like this:

<div class="footer" id="footer">
 <other tags> ... bla ... </other tags>
</div>

But it appears that the same "part" of the page can be written differently on a certain sub-page, like this:

<div id="footer" class="footer">
 <other tags> ... bla ... </other tags>
</div>

The thing I achieved is to get rid of specific cases:

footer = re.sub('<div class="footer" id="footer">.*?</div>','',html)

But what I want is a Regex that is more general, so if he should get rid of every the parts when, e.g. "id="footer" no matter whats in front or behind it

<div ... id="footer" ...> 
<other tags> ... bla ... </other tags>    
</div> 

EDIT: before getting "hated", I´m pretty new to HTML parsers too.

Thanks for the help!

MG

5
  • 3
    Why cannot you use an HTML parser for this problem as well? Commented Jan 3, 2017 at 12:55
  • change .*? to [\s\S]*? or use flags=re.DOTALL, and of course if there is </div> inside it won't work, use HTML parser instead. Commented Jan 3, 2017 at 12:56
  • 5
    stackoverflow.com/a/1732454/4954037 Commented Jan 3, 2017 at 12:56
  • I just realized that I can find the respective part with soup.findAll('div',{'id':'footer'})can I also get rid of this parts with the HTML parser? Commented Jan 3, 2017 at 13:12
  • why you want to get rid of it? with soup only select div which you need. Commented Jan 3, 2017 at 13:15

1 Answer 1

1

Why would you want to remove it? As Bhavesh said just select the ones which you want. But if you want to know if we can remove them then yes you can get rid of them by decompose()

a="""
<div class="footer" id="footer">
 <p>lskjdf</p>
</div>

<div id="not_footer" class="footer">
<p>lskjdf</p>
</div>
"""
b = BeautifulSoup(a)
print b
print '---------------------'
print '---------------------'
for c in b.select('div#footer'):
    c.decompose()
print b

Output:

<html><body><div class="footer" id="footer">
<p>lskjdf</p>
</div>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>
---------------------
---------------------
<html><body>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>
Sign up to request clarification or add additional context in comments.

1 Comment

This is very useful for me and basically excactly what I want. I just want to use everything else except the parts that are called "footer"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.