I´m trying to get rid of some elements of the HTML code before using an html parser. I´m pretty new to regex and thats why I have problems understanding the syntax.
Parts of my html-code look like this:
<div class="footer" id="footer">
<other tags> ... bla ... </other tags>
</div>
But it appears that the same "part" of the page can be written differently on a certain sub-page, like this:
<div id="footer" class="footer">
<other tags> ... bla ... </other tags>
</div>
The thing I achieved is to get rid of specific cases:
footer = re.sub('<div class="footer" id="footer">.*?</div>','',html)
But what I want is a Regex that is more general, so if he should get rid of every the parts when, e.g. "id="footer" no matter whats in front or behind it
<div ... id="footer" ...>
<other tags> ... bla ... </other tags>
</div>
EDIT: before getting "hated", I´m pretty new to HTML parsers too.
Thanks for the help!
MG
.*?to[\s\S]*?or useflags=re.DOTALL, and of course if there is </div> inside it won't work, use HTML parser instead.soup.findAll('div',{'id':'footer'})can I also get rid of this parts with the HTML parser?