public class Grep extends java.lang.Object implements ServiceFactory
is a class and a command line program to copy input to output
depending on matching regular expressions. The following
paragraphs describe how Grep
works. You need to read
it in order to understand the documentation of the constructor as
well as the output generated when passing the option
"-h"
to the command line program.
Grep
accepts an arbitrary number of pairs
(re,f) on the command line where
PrintfFormatter
In normal operation, whenever one of the regular expressions
matches, the respective format is used to print the
match. Non-matching text is deleted, except if option
"-cr"
(parameter copy
) is given.
Example:
java monq.programs.Grep 'insulin *[^ ]+' "%0(8,0)\n"
is a rough way to find all the words which appear just after insulin in a text.
In more interesting applications, option "-r"
(or
the many-argument constructor) is
used with two regular expressions, roiOn
and
roiOff
, to specify a Region Of Interest
(ROI). Grep
will then apply its regular expressions
only in those ranges of input text which lie between pairs of
matches of roiOn
and roiOff
. In particular
Grep
starts pattern matching just after a match of
roiOn
and stops as soon as a match of roiOff
is
found. The generated generated output is as follows:
"-co"
, or parameter copyEnv
)."-cr"
or parameter select
).roiOn
and roiOff
is copied
(change with "-rf"
or parameters rfOn
and
rfOff
)."-d"
or parameter
fmRoi
).java programs.Grep \ -r '<MedlineCitation' '</MedlineCitation>' \ -rf "%0" "%0\n\n" \ insulin "[[[%0]]]"
will fetch from a Medline file all those entries which contain the word insulin somewhere. The output of each ROI will be followed by a pair of newlines. In addition, the word insulin will be triple bracketed in the output.
java programs.Grep \ -cr \ -r '<protein>' '</protein>'
extracts all <protein>
elements from an XML
file without testing them for any regular expression.
java programs.Grep \ -cr \ -r '<protein>' '</protein>' \ -rf '%0' '%0\n' '[\r\n\t ]+' ' '
works almost as above. In addition whitespace, which includes
newlines, is normalized to one space character. As a result, each
whole <protein>
element will be on a line on its
own. You may then process the output with classic grep
again if you feel more comfortable with it.
Regular expressions (REs) are not trivial. In particular when working with several REs competing in parallel for matches, things easily start to be confusing. The notes below might help to sort some things out.
Grep
is not line oriented like the classic
grep
. It does not care for line separators, except if
you specify them explicitly somehow in your regular
expressions.
'.*'
or '.+'
.Grep
lumps together all given REs plus the
regular expression denoting the end of the ROI into one big RE and
matches everything in parallel. The longest match wins. Guess which
RE wins if '<tag>'
competes with
'.+'
? The latter always matches all the way to the
very end of input and wins. Consequently the
'<tag>'
will never win and therefore the ROI
will never end.
Instead of '.+'
always try to use a restricted
character set. For XML elements which can not have child elements,
'[^<]+'
is your best bet.
There is, however, one admissible use of '.*'
. When
you use it together with a shortest match operator like in
'(.*</endtag>)!'
, you are guaranteed that the
match will extend exactly to the first match of
'</endtag>'
.
'%1'
, '%2'
, etc. in
formatsApart from '%0'
, position parameters
'%1'
, etc. can be
used in a format string
under
certain circumstances. In contrast to many other regular expression
packages, the package employed in Grep
does not work with
so called capturing parentheses1 by
default. Instead, a pair of parentheses is made into reporting
parentheses by using an exclamation mark as the first character
after the opening parenthesis, like in "(![a-z]+)"
.
Example:
Grep '<entry +[a-zA-Z]+="(![^"]+)"' "(%1)\n"will fetch the first attribute value from an XML element called
'<entry>'
.
1) This has to do with the
fact that
Grep
works with deterministic finite automata which
normally completely preclude the use of capturing parentheses. The
details why this is so are too complicated to explain here. Details
can be found
here.
Constructor and Description |
---|
Grep(boolean copy,
boolean autoPrio,
java.lang.String[] args)
create a
Grep without ROI. |
Grep(java.lang.String roiOn,
java.lang.String roiOff,
java.lang.String rfOn,
java.lang.String rfOff,
DfaRun.FailedMatchBehaviour fmRoi,
boolean select,
boolean copyEnv,
boolean autoPrio,
java.lang.String[] args)
create a
Grep with ROI. |
Modifier and Type | Method and Description |
---|---|
DfaRun |
createRun()
is the way to apply the machinery created with one of the
constructors.
|
Service |
createService(java.io.InputStream in,
java.io.OutputStream out,
java.lang.Object param)
creates a
Service that reads from the given input
stream, processes it and writes the result to the given output
stream. |
static void |
main(java.lang.String[] argv)
call with command line option
"-h" for a short
summary of operation. |
public Grep(boolean copy, boolean autoPrio, java.lang.String[] args) throws ReSyntaxException, CompileDfaException
create a Grep
without ROI.
copy
- if true
, non matching text is copied.autoPrio
- if true
, all regular expressions in
args
are auto-prioritized to suppress any
CompileDfaException
due to competing regular
expressions.args
- pairs of regular expression and Printf
formats.ReSyntaxException
- if either a regular expression or a
format string has a syntax errorCompileDfaException
- if any one of the regular expressions
matches the empty string or if autoPrio==false
and
there are competing regular expressions.public Grep(java.lang.String roiOn, java.lang.String roiOff, java.lang.String rfOn, java.lang.String rfOff, DfaRun.FailedMatchBehaviour fmRoi, boolean select, boolean copyEnv, boolean autoPrio, java.lang.String[] args) throws ReSyntaxException, CompileDfaException
create a Grep
with ROI. With a ROI, the input text
is separated into 4 different parts:
In addition a difference is made between ROIs containing a match and those which don't. The parameters allow detailed control over how to handle all the details.
roiOn
- regular expression starting ROIroiOff
- regular expression ending ROIrfOn
- format with which to print start of ROI,
null
means use Copy.COPY
rfOff
- format with which to print end of ROI
null
means use Copy.COPY
fmRoi
- how to handle non matching input within ROIselect
- true
will copy only ROIs with a matchcopyEnv
- true
will copy anything not in ROIautoPrio
- if true
, all regular expressions in
args
are auto-prioritized to suppress any
CompileDfaException
due to competing regular
expressions.args
- pairs of regular expression and Printf
formats.ReSyntaxException
- if either a regular expression or a
format string has a syntax errorCompileDfaException
- if any one of the regular expressions
matches the empty string or if autoPrio==false
and
there are competing regular expressions.public DfaRun createRun()
is the way to apply the machinery created with one of the
constructors. Simply use one of the filter()
methods
supplied by a DfaRun
.
public Service createService(java.io.InputStream in, java.io.OutputStream out, java.lang.Object param)
ServiceFactory
creates a Service
that reads from the given input
stream, processes it and writes the result to the given output
stream.
This method should return as fast as possible, because it is
run in the TcpServer
's main thread and thereby no
other, parallel connections can be initiated while this method is
at work. In particular nothing that can block the thread, like
reading input, should be done in this method. Either move setup
code into the constructor of the ServiceFactory
or,
if it is connection related, move it into the run()
method of the Service
itself.
A similar note obviously applies to the constructor of the
Service
which is most probably called by this
method.
Notice: The Service
created should not
close the streams. While it would be logical to close the
streams after reaching eof, this does not work well with streams
originating from a socket. Consequently the method which calls a
ServiceFactory
has to take care to eventually close
the streams. It is, however, a good idea to flush the output
stream.
createService
in interface ServiceFactory
param
- is an arbitrary parameter object that may be used to
tweak the service created beyond its input and output
stream. TcpServer
sets it to null
, but when
stacking up service factories (like with FilterServiceFactory
), this is useful.public static void main(java.lang.String[] argv) throws java.lang.Exception
"-h"
for a short
summary of operation.java.lang.Exception