How to replace html elements in a string by python?

Question

I have a string like below, which contains Chinese:

'<span class=H>宜家</span><span class=H>同款</span> 世纪宝贝儿童餐椅婴儿餐椅宝宝餐椅婴儿吃饭椅'

Now I would like to delete all html elements for this string as expected:

'宜家同款世纪宝贝儿童餐椅婴儿餐椅宝宝餐椅婴儿吃饭椅'

May I know how to do this by python and re? thanks a lot!

alecxe · Accepted Answer · 2015-09-09 17:23:18Z

5

This is something trivial to solve with BeautifulSoup HTML parser:

>>> from bs4 import BeautifulSoup
>>>
>>> data = '<span class=H>宜家</span><span class=H>同款</span> 世纪宝贝儿童餐椅婴儿餐椅宝宝餐椅婴儿吃饭椅'
>>> soup = BeautifulSoup(data)
>>> soup.text
'宜家同款 世纪宝贝儿童餐椅婴儿餐椅宝宝餐椅婴儿吃饭椅'

answered Sep 9, 2015 at 17:23

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Coeus Wang Over a year ago

It looks a good solution. I just thought to use regex and I did't get a correct solution. Thanks a lot, I will try this.

Steven Doggart · Accepted Answer · 2015-09-09 18:05:23Z

1

For a simple solution that uses just regex, you can search the following pattern and replace all occurrences of it with an empty string:

\s*<[^>]+>\s*

For instance:

p = re.compile( '\s*<[^>]+>\s*')
p.sub( '', '<span class=H>宜家</span><span class=H>同款</span> 世纪宝贝儿童餐椅婴儿餐椅宝宝餐椅婴儿吃饭椅')

Disclaimer: This will, by no means, handle every possible variation of legal HTML, but, as long as all of the input data, is as simple as the data in your example, it will work. You could make changes to the pattern, as necessary, to handle slightly more complex inputs. However, if your intent is to handle any well-formed HTML document as input, then you should consider an actual HTML parser rather than using regex.

edited Sep 9, 2015 at 18:05

answered Sep 9, 2015 at 17:41

Steven Doggart

43.8k8 gold badges71 silver badges109 bronze badges

2 Comments

Pedro Pinheiro Over a year ago

By including \s like this /\s*<[^>]+>\s*/g will eliminate all the spaces in the result.

Steven Doggart Over a year ago

@PedroPinheiro Pood point. I didn't notice that the desired output in the OP did have the spaces removed. I'll update my answer accordingly. However, the bookend-slashes are not necessary in Python. Also, re.sub uses the global option by default, so the g is also unnecessary.

Collectives™ on Stack Overflow

How to replace html elements in a string by python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related