6

I'm trying to use a Regex expression I've found in this website and it doesn't seem to work. Any ideas?

Input string:

sFetch = "123<script type=\"text/javascript\">\n\t\tfunction utmx_section(){}function utmx(){}\n\t\t(function()})();\n\t</script>456";

Regex:

sFetch = Regex.Replace(sFetch, "<script.*?>.*?</script>", "", RegexOptions.IgnoreCase);
5
  • 4
    You should not use regex to try to parse HTML : HTML is not quite regular ;; instead, you should use an HTML Parser -- like based on DOM. Commented Mar 24, 2010 at 7:27
  • 2
    It looks like you haven't read this article explaining how to use regex to parse HTML: stackoverflow.com/questions/1732348/… Commented Mar 24, 2010 at 7:28
  • See S.Mark's answer. But all in all, it's not a good regex, and anyway regexes aren't really suited for this. Commented Mar 24, 2010 at 7:28
  • 1
    @Pascal MARTIN: He don't want to parse, just remove some text. Don't see a difference? Commented Mar 24, 2010 at 7:31
  • 1
    Tim and Pascal are correct. Whenever I write code to look for dangerous HTML constructs, I always use a DOM, never a regex. If for no other reason than there are so many ways to escape HTML it's next to impossible to regex. Commented Mar 24, 2010 at 7:32

4 Answers 4

9

Add RegexOptions.Singleline

RegexOptions.IgnoreCase | RegexOptions.Singleline

And that will never work on follow one.

<script
>
alert(1)
</script
/**/
>

So, Find a HTML parser like HTML Agility Pack

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Any other recommendations about C# packages like Agility to parse HTML???
Singleline is the option you want; it allows . to match linefeeds. Multiline causes $ and ^ to match before and after (respectively) linefeeds; it's irrelevant here.
8

The reason the regex fails is that your input has newlines and the meta char . does not match it.

To solve this you can use the RegexOptions.Singleline option as S.Mark says, or you can change the regex to:

"<script[\d\D]*?>[\d\D]*?</script>"

which used [\d\D] instead of ..

\d is any digit and \D is any non-digit, so [\d\D] is a digit or a non-digit which is effectively any char.

1 Comment

Thanks. Is this a solution also for nested script tags?
5

If you actually want to sanitize a html string (and you're using .NET) then take a look at the Microsoft Web Protection Library:

Sanitizer.GetSafeHtmlFragment(untrustedHtml);

There's a description here.

Comments

2

This is a bit shorter:

 "<script[^<]*</script>"

or

"<[^>]*>[^>]*>"

4 Comments

Thanks. Is this a solution also for nested script tags?
Yes, absolutely because scripts are never nested.
They can be nested in a way, actually. For example if someone assigns variable like var a = "<script>somenestedscript</script>"; inside of it.
var a = "<script>somenestedscript</script>"; can never happen this will brak JS itself... then it would be like var a = "<script>somenestedscript</scr" + "ipt>";

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.