`ReClassicParser`'s regular expression syntax

The regular expression syntax employed by ReClassicParser is similar but not the same as that used in perl, python, tcl or the C library. The two reasons for this are that it sacrifices some features for speed and that it contains operators which are not known in standard implementations: invert(), not() and shortest().

Regular expressions are built up from atoms and operators which combine atoms into larger regular expressions.

Atoms

c

A single character is a regular expression matching exactly this character. Characters with a special meaning in the regular expression syntax must be escaped as described below. Since Java is UNICODE based, a character can be any UNICODE character.

\xhh \uhhhh

denotes the unicode character with the hexadecimal value according to the 2 or 4 hexadecimal digits, which can be lower or upper case.

\c

Except for the case noted above, a character escaped by a preceding backslash matches that character, even if that character otherwise has a special meaning in the regular expression syntax. Note that two consecutive backslash characters are needed in constant strings within Java programs. To construct an Nfa which matches the open bracket, write

Nfa nfa = new Nfa("\\[")

[...]

A set of characters enclosed in brackets matches any single character mentioned in the set. Within the brackets, a range of characters is denoted by the first and the last character of the range separated by the dash, e.g. [a-z]. If the caret "^" is the first character within the brackets, the character set is inverted, i.e. it will match any character not in the set. To include the right bracket, dash or caret in the set, they must be preceeded by a backslash. (Two backslashes in constant strings in a Java program, because one backslash is eaten up by the compiler.)

(re)

A regular expression re enclosed in parentheses matches exactly whatever the re matches.

(!re)

A regular expression re enclosed by (! and ) matches exactly whatever the re matches. In addition, the re is made into a reporting subexpression.

Operators

re?: A regular expression re followed by the question mark matches re or the empty string.
re+: A regular expression re followed by the plus sign matches one or more occurences of re.
re*: A regular expression re followed by the asterisk matches zero or more occurences of re.
re{n} re{n,} re{n,m}: A regular expression followed by one of the shown range specifications matches exactly n, at least n or between n and m (inclusive) occurences of re. The range may not specify the empty string, e.g. {0} or {0,0}. Further n≤m is required.
Note: A Dfa cannot count. Therefore the Dfa for re is internally replicated up to m times. For a large Dfa or large m or even both, this will use a lot of memory. The feature was unit tested with m<200 and a 4 character Dfa.
re!: A regular expression re followed by the exclamation mark matches the shortest match satisfying re. This operator is particularly useful to jump to the first occurence of some string. For example the expression "(.*</b>)!" matches everything up to and including the first "</b>" found. A comparison with the non-greedy "*" operator available in other regular expression engines can be found below.
re~: A regular expression re followed by the tilde matches any string which does not match re (See also invert().)
re^: A regular expression re followed by the hat (caret) matches any string which does not contain a match of re. The tilde operator is a convenience shortcut for "((.*re.*)?)~". If in doubt whether to use re~ or re^, read the explanation about not() and use re^.
re@: A regular expression refollowed by the at sign matches all strings that re matches as well as all non-empty prefixes of these.
re₁re₂: matches all strings which match re₁ immediately followed by a match of re₂.
re₁|re₂: matches all strings which match either re₁ or re₂.

Reporting Subexpressions

In his reference book about regular expressions, Friedl notes on page 150 that

The way a DFA engine works completely precludes these concepts.

while referring to reporting subexpressions. Since this package relies on deterministic finite automata (DFA), there should be no reporting subexpressions. But there are. And in general — they do not work.

But for many practically relevant cases they do work quite well. It requires, however, some overhead to keep track of subexpressions in the DFA. This is why, in contrast to other packages, reporting subexpressions are not the default. They have to be specifically requested by enclosing a regular expression in (! and ), like in "(![ab]+)".

As mentioned above, subexpression tracking is not perfect. It breaks down in particular when the reporting subexpression competes with another subexpression in matching. As an example consider

(!(ab)+)(!abc)

which matches a sequence of 'a's and 'b's terminated by a 'c'. Obviously the two subexpressions overlap on the string "ab" and the implementation will report this: within the string

"---ababc---"

the match "ababc" will be found with the two submatches "abab" and "abc" assigned to the two subexpressions respectively.

As another example consider the regular expression

(!ab*)(!abc)

when matching in the string "---aaaabc---aaababc---". Then the reported match and its first and second subgroup are as follows:

%0	%1	%2
aabc	a	abc
ababc	ab	abc

See monq.jfa.PrintFormatter for an explanation of how to use %0, etc. in format strings. Submatches and formatting are used together in a monq.jfa.actions.Printf object.

The difference to the previous example is that when the second 'a' of the matching string is seen, there is no doubt that the second subgroup has to be entered. In the previous example, when encountering the second 'a', it may be the start of another "ab" pair of the first subgroup or it may be the start of the second subgroup.

The bottom line is: the more clear cut the separation between subexpressions is, the better it will work. Nested subexpressions may work, but are not at all tested.

What if reporting subexpressions do not work

If the results of using submatches turn out to be unusable, Jfa offers also another approach to get hold of parts of a long match. The matched string is simply analysed a second time. But since the matching text is known to stem from a certain regular expression, its structure is known and verified. Consequently the second analysis can usually be quick, simple and dirty. Examples are splits at fixed character positions or fixed characters. Callback objects which split a matched text must be of type TextSplitter. The interface describes functionality to split a string of known structure into parts. The course of action then goes like this:

The DFA finds a match which has interesting parts.
The whole match is split into parts by a TextSplitter.
Finally, a Formatter is called to rearrange the resulting parts into the output.

When it comes to using a TextSplitter there are two scenarios:

When you have full control over the regular expression and the associated TextSplitter, there is actually no reason to use a Formatter. Just write a normal FaAction which calls the TextSplitter and then use the result to assemble some new text to replace the matched text.
If, however, you want to supply only the regular expression and a corresponding TextSplitter as a module for other programmers to use, then write a method which creates an Nfa bound to an FaAction which first calls your TextSplitter and passes result to the Formatter supplied by the user of your module.

An example of the latter is EmptyElemTagNfa which makes sure that the regular expression used and the TextSplitter applied correspond to each other.

Non Greedy Matching vs. Shortest Match

The difference is best explained by an example. When trying to match a full XML element followed by a certain context, one may be tempted to write

    <tag>.*?</tag><otherTag>

employing the non-greedy operator "*?" available in java.util.regex. However, non-greedy operators sacrifice the shortest possible match for an overall match of the regular expression if necessary. Consequently the above expression would match the text

    <tag>bla</tag><somestuff>...</somestuff><tag>xxx</tag><otherTag>

just because the longer match satisfies the regular expression, while stopping at the first </tag> would not match.

In contrast, the shortest match operator as implemented by jfa does not give up the shortest possible match of a subexpression to allow the whole expression to match. Consequently,

    <tag>(.*</tag>)!<otherTag>