Using Regex to remove script tags

Question

I'm trying to use a Regex expression I've found in this website and it doesn't seem to work. Any ideas?

Input string:

sFetch = "123<script type=\"text/javascript\">\n\t\tfunction utmx_section(){}function utmx(){}\n\t\t(function()})();\n\t</script>456";

Regex:

sFetch = Regex.Replace(sFetch, "<script.*?>.*?</script>", "", RegexOptions.IgnoreCase);

You should not use regex to try to parse HTML : HTML is not quite regular ;; instead, you should use an HTML Parser -- like based on DOM. — Pascal MARTIN
– Pascal MARTIN, Commented Mar 24, 2010 at 7:27
It looks like you haven't read this article explaining how to use regex to parse HTML: stackoverflow.com/questions/1732348/… — Darin Dimitrov
– Darin Dimitrov, Commented Mar 24, 2010 at 7:28
See S.Mark's answer. But all in all, it's not a good regex, and anyway regexes aren't really suited for this. — Tim Pietzcker
– Tim Pietzcker, Commented Mar 24, 2010 at 7:28
@Pascal MARTIN: He don't want to parse, just remove some text. Don't see a difference? — Kamarey
– Kamarey, Commented Mar 24, 2010 at 7:31
Tim and Pascal are correct. Whenever I write code to look for dangerous HTML constructs, I always use a DOM, never a regex. If for no other reason than there are so many ways to escape HTML it's next to impossible to regex. — Michael Howard-MSFT
– Michael Howard-MSFT, Commented Mar 24, 2010 at 7:32

YOU · Accepted Answer · 2010-03-24 15:50:44Z

9

Add RegexOptions.Singleline

RegexOptions.IgnoreCase | RegexOptions.Singleline

And that will never work on follow one.

<script
>
alert(1)
</script
/**/
>

So, Find a HTML parser like HTML Agility Pack

edited Mar 24, 2010 at 15:50

answered Mar 24, 2010 at 7:27

YOU

124k34 gold badges192 silver badges222 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

amitre Over a year ago

Thanks. Any other recommendations about C# packages like Agility to parse HTML???

Alan Moore Over a year ago

Singleline is the option you want; it allows . to match linefeeds. Multiline causes $ and ^ to match before and after (respectively) linefeeds; it's irrelevant here.

YOU · Accepted Answer · 2010-03-24 15:53:41Z

8

The reason the regex fails is that your input has newlines and the meta char . does not match it.

To solve this you can use the RegexOptions.Singleline option as S.Mark says, or you can change the regex to:

"<script[\d\D]*?>[\d\D]*?</script>"

which used [\d\D] instead of ..

\d is any digit and \D is any non-digit, so [\d\D] is a digit or a non-digit which is effectively any char.

edited Mar 24, 2010 at 15:53

YOU

124k34 gold badges192 silver badges222 bronze badges

answered Mar 24, 2010 at 7:30

codaddict

457k83 gold badges501 silver badges537 bronze badges

1 Comment

amitre Over a year ago

Thanks. Is this a solution also for nested script tags?

Nigel · Accepted Answer · 2011-10-28 08:45:09Z

5

If you actually want to sanitize a html string (and you're using .NET) then take a look at the Microsoft Web Protection Library:

Sanitizer.GetSafeHtmlFragment(untrustedHtml);

There's a description here.

edited Oct 28, 2011 at 8:45

answered Oct 28, 2011 at 8:33

Nigel

2,1603 gold badges20 silver badges23 bronze badges

Comments

instcode · Accepted Answer · 2010-03-24 08:17:33Z

2

This is a bit shorter:

 "<script[^<]*</script>"

or

"<[^>]*>[^>]*>"

edited Mar 24, 2010 at 8:17

answered Mar 24, 2010 at 7:55

instcode

1,49514 silver badges16 bronze badges

4 Comments

amitre Over a year ago

Thanks. Is this a solution also for nested script tags?

instcode Over a year ago

Yes, absolutely because scripts are never nested.

DitherSky Over a year ago

They can be nested in a way, actually. For example if someone assigns variable like var a = "<script>somenestedscript</script>"; inside of it.

YvesR Over a year ago

var a = "<script>somenestedscript</script>"; can never happen this will brak JS itself... then it would be like var a = "<script>somenestedscript</scr" + "ipt>";

Collectives™ on Stack Overflow

Using Regex to remove script tags

4 Answers 4

2 Comments

1 Comment

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related