1

Looking for a regex based String replacement in Java on the below use-case. I'm doing some Groovy based XML processing and due to some custom processing (won't go much in detail on this), the resulting XML has some invalid tags, for e.g.

<?xml version='1.0' encoding='UTF-8'?>
<Customer id="xyz" xmlns='http://abc.com'>
<order orderGroup="mock">
    <entry>
        <key>test</key>
    </entry>
</order orderGroup="mock">
</Customer id="xyz">

If you note, the end tags of the element names containing attributes are messed up. The XML is just treated as a string, so what I want to do is, just replace occurrences of such end tags via string regex based replacements. For e.g. replace

</order orderGroup="mock"> with </order>, 
</Customer id="xyz"> with </Customer>

Any idea if there is quick Java String based regex I can use for doing such replacements ?

Thanks.

1
  • What happens when you load the javadoc, hit Ctrl-F, and type "regex"? Why don't you fix the faulty "custom processing" which generates garbage instead of trying to workaround the problem? Commented Mar 22, 2013 at 23:51

2 Answers 2

5

try

    xml = xml.replaceAll("</([^ >]+).*?>", "</$1>");
Sign up to request clarification or add additional context in comments.

2 Comments

+1, but I would have used </([^\s>]+)[^>]+>. .*? is a fickle friend; why put yourself at its mercy when you can so easily say exactly what you want?
I agree about \\s, but it seems regex converts "<e1><e2></e2></e1>" -> "<e1><e2></e></e>"
2

The easiest solution is to fix your custom XML processing and have it generate valid XML.

The easy solution is to use something like JTidy to clean up your XML.

If you must use regex, you could try something like this:

Pattern pattern = Pattern.compile("</([A-Za-z]+) [^>]+>");
Matcher matcher = pattern.matcher(xml);

if(matcher.find()) {
   xml = matcher.replaceAll(matcher.group(1));
}

I haven't tested this out, so keep that in mind. There might be a few issues.

Explanation of the regex:

<         -> The opening angle bracket of the tag
/         -> The / that marks a closing tag
(         -> Start of a capturing group. We want to capture the actual ending tag.
[A-Za-z]+ -> One or more alphabetic characters (upper and lowercase)
)         -> End of the capturing group.
          -> A space.
[^>]+     -> One or more of anything that is not a closing angle-bracket.
>         -> The closing angle bracket of the tag.

3 Comments

Thanks Vivin! That works to an extent. Only issue is, it replaced even the start and ending angle brackets. In other words, it results in Customer instead of <Customer>
As Evgeniy's answer shows, this solution is much more verbose than it needs to be. In particular, it's never necessary to call find() before doing the substitution. replaceAll() does that itself, and if there are no matches it returns the original string unchanged. You don't need to call methods like group(n) for the replacement string, either. If there happen to be any question marks or backslashes in the string you'll get a runtime exception; that's not a problem if you use "$1".
Yes his answer is much better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.