ReClassicParser
's regular expression syntaxThe regular expression syntax employed by ReClassicParser is similar but
not the same as that used in perl, python, tcl
or the
C library. The two reasons for this are that it sacrifices some
features for speed and that it contains operators which are not
known in standard implementations: invert(), not() and shortest().
Regular expressions are built up from atoms and operators which combine atoms into larger regular expressions.
Nfa
which matches the
open bracket, write
Nfa nfa = new Nfa("\\[")
[a-z]
. If the caret "^"
is the
first character within the brackets, the character set is
inverted, i.e. it will match any character not in the set. To
include the right bracket, dash or caret in the set, they must
be preceeded by a backslash. (Two backslashes in constant
strings in a Java program, because one backslash is eaten up
by the compiler.)
(!
and )
matches exactly whatever the re matches. In addition,
the re is made into a reporting
subexpression.
"(.*</b>)!"
matches everything up to and
including the first "</b>"
found. A
comparison with the non-greedy "*"
operator
available in other regular expression engines can
be found below.
"((.*re.*)?)~"
. If in doubt whether to use
re~
or re^
, read the explanation
about not() and use re^
.
In his reference book about regular expressions, Friedl notes on page 150 that
The way a DFA engine works completely precludes these concepts.while referring to
reporting subexpressions. Since this package relies on deterministic finite automata (DFA), there should be no reporting subexpressions. But there are. And in general — they do not work.
But for many practically relevant cases they do work quite
well. It requires, however, some overhead to keep track of
subexpressions in the DFA. This is why, in contrast to other
packages, reporting subexpressions are not the default. They have
to be specifically requested by enclosing a regular expression in
and (!
, like in
)
"(![ab]+)"
.
As mentioned above, subexpression tracking is not perfect. It breaks down in particular when the reporting subexpression competes with another subexpression in matching. As an example consider
(!(ab)+)(!abc)
which matches a sequence of 'a's and 'b's terminated by a
'c'. Obviously the two subexpressions overlap on the string
"ab"
and the implementation will report this: within
the string
"---ababc---"
the match "ababc"
will be found with the two
submatches "abab"
and "abc"
assigned to
the two subexpressions respectively.
As another example consider the regular expression
(!ab*)(!abc)
when matching in the string
"---aaaabc---aaababc---"
. Then the reported match and
its first and second subgroup are as follows:
%0 | %1 | %2 |
aabc | a | abc |
ababc | ab | abc |
See
monq.jfa.PrintFormatter
for an explanation of
how to use %0
, etc. in format strings. Submatches
and formatting are used together in a
monq.jfa.actions.Printf
object.
The difference to the previous example is that when the second
'a'
of the matching string is seen, there is no doubt
that the second subgroup has to be entered. In the previous
example, when encountering the second 'a'
, it may be
the start of another "ab"
pair of the first
subgroup or it may be the start of the second subgroup.
The bottom line is: the more clear cut the separation between subexpressions is, the better it will work. Nested subexpressions may work, but are not at all tested.
If the results of using submatches turn out to be unusable,
Jfa
offers also another approach to
get hold of parts of a long match. The matched string is simply
analysed a second time. But since the matching text is known to
stem from a certain regular expression, its structure is known
and verified. Consequently the second analysis can usually be
quick, simple and dirty. Examples are splits at fixed character
positions or fixed characters. Callback objects which split a
matched text must be of type
TextSplitter. The interface
describes functionality to split a string of known structure
into parts. The course of action then goes like this:
When it comes to using a TextSplitter
there are two
scenarios:
TextSplitter
, there is actually no
reason to use a Formatter
. Just write a normal
FaAction
which calls the
TextSplitter
and then use the result to assemble
some new text to replace the matched text.TextSplitter
as a module for
other programmers to use, then write a method which creates an
Nfa
bound to an FaAction
which first
calls your TextSplitter
and passes result to the
Formatter
supplied by the user of your
module.
An example of the latter is
EmptyElemTagNfa which
makes sure that the regular expression used and the
TextSplitter
applied correspond to each other.
The difference is best explained by an example. When trying to match a full XML element followed by a certain context, one may be tempted to write
<tag>.*?</tag><otherTag>employing the non-greedy operator
"*?"
available in
java.util.regex
. However, non-greedy
operators sacrifice the shortest possible match for an overall
match of the regular expression if necessary. Consequently the
above expression would match the text<tag>bla</tag><somestuff>...</somestuff><tag>xxx</tag><otherTag>just because the longer match satisfies the regular expression, while stopping at the first
</tag>
would
not match. In contrast, the shortest match operator as
implemented by jfa
does not give up the shortest
possible match of a subexpression to allow the whole expression to
match. Consequently,
<tag>(.*</tag>)!<otherTag>would not match at the beginning of the above string.