public class DictFilter extends java.lang.Object implements ServiceFactory
(Command Line Program) is a class to create a text markup filter
from a long list of term/ID or regex/ID mappings. A canonical use,
implemented in main(java.lang.String[])
, looks like:
InputStream mwtFile = new FileInputStream(mwtFileName); DictFilter dict = new DictFilter(mwtFile, inputType, elemName, verbose); mwtFile.close(); DfaRun r = dict.createRun(); r.setIn(System.in); r.filter(System.out);An example input file looks like:
<?xml version='1.0'?
<mwt>
<template><protein id="%1">%0</protein></template>
<t p1="ipi4355">casein</t>
<t p1="X33355">p53</t>
<template><disease id="%1">%0</disease></template>
<t p1="UMLS4711" p2="bla">alzheimer</t>
<template><special>%0</special></template>
<r>[ \r\n\t;.,:?!]</r>
</mwt>
<template>
element
describes the output to be produced on a match. The whole content of the
element is taken as-is and interpreted as a format for
PrintfFormatter
. It applies to all the
<r>
and <t>
elements that follow
before another <template>
element or before the end of
the file. Print format directive %0
refers the whole match,
while %1
etc. refers to the values provided as attributes
named p1
etc. in the <r>
and
<t>
elements.
<t>
element
describes a dictionary term. It will be converted to a regular expression
with Term2Re.convert()
. Note in particular that
terms are matched with a one character trailing context.
<r>
element
allows to specify an arbitrary regular expression. The regular expression
is taken as-is. In addition to the attributes p1
,
p2
, etc. as described above, the following attributes are
allowed:
<template>
is processed. For example
<r tc="1">[a-z]+ </r>
would match a string of lowercase characters only if followed by a blank,
but only the lowercase characters are considered part of the match. Should
a match be shorter than or equal to the length given, nothing at all is
pushed back into the input.
The input encoding is guessed from the input file with
EncodingDetector.detect()
. This
can be changed with setInputEncoding()
. The
output encoding is defaults to the platforms default encoding but can be
changed with setOutputEncoding()
.
Constructor and Description |
---|
DictFilter(java.io.Reader mwtFile,
java.lang.String inputType,
java.lang.String elemName,
boolean verbose) |
DictFilter(java.io.Reader mwtFile,
java.lang.String inputType,
java.lang.String elemName,
boolean verbose,
boolean memDebug,
boolean defaultWord)
creates a
DictFilter from the given
Reader which must comply to the format
described above. |
Modifier and Type | Method and Description |
---|---|
DfaRun |
createRun()
create a
DfaRun object suitable to operate the
dictionary DFA. |
Service |
createService(java.io.InputStream in,
java.io.OutputStream out,
java.lang.Object p)
creates a
Service that reads from the given input
stream, processes it and writes the result to the given output
stream. |
Dfa |
getDfa()
returns the dictionary DFA.
|
static void |
main(java.lang.String[] argv)
run on the commandline with
-h to get a description. |
void |
setInputEncoding(java.lang.String enc)
force
createService() to set up the
filter with the given input encoding. |
void |
setOutputEncoding(java.lang.String enc) |
public DictFilter(java.io.Reader mwtFile, java.lang.String inputType, java.lang.String elemName, boolean verbose) throws java.io.IOException, ReSyntaxException, CompileDfaException
java.io.IOException
ReSyntaxException
CompileDfaException
public DictFilter(java.io.Reader mwtFile, java.lang.String inputType, java.lang.String elemName, boolean verbose, boolean memDebug, boolean defaultWord) throws java.io.IOException, ReSyntaxException, CompileDfaException
creates a DictFilter
from the given
Reader
which must comply to the format
described above. The inputType
is one of the strings
"raw"
, "xml"
or "elem"
with the following meaning:
"raw"
"xml"
"elem"
elemName
.mwtFile
- dictionary file as described above in the class
descriptioninputType
- one of the strings "raw"
,
"xml"
or "elem"
elemName
- is the XML element to work on in case
inputType=="elem"
verbose
- when true, will dump all generated regular
expressions to System.err
memDebug
- if true
, estimated object sizes of
Nfa and Dfa will be dumped to System.err
after
compilation of the Nfa. Normally set this to false
.defaultWord
- when true, add a catch all word to the
generated automaton to prevent against matching in the middle of
a word. This is here for historical reasons. Normally the
catch-all should be in the mwt file.java.io.IOException
ReSyntaxException
CompileDfaException
public void setInputEncoding(java.lang.String enc)
force createService()
to set up the
filter with the given input encoding.
public void setOutputEncoding(java.lang.String enc)
public Dfa getDfa()
public Service createService(java.io.InputStream in, java.io.OutputStream out, java.lang.Object p) throws ServiceCreateException
ServiceFactory
creates a Service
that reads from the given input
stream, processes it and writes the result to the given output
stream.
This method should return as fast as possible, because it is
run in the TcpServer
's main thread and thereby no
other, parallel connections can be initiated while this method is
at work. In particular nothing that can block the thread, like
reading input, should be done in this method. Either move setup
code into the constructor of the ServiceFactory
or,
if it is connection related, move it into the run()
method of the Service
itself.
A similar note obviously applies to the constructor of the
Service
which is most probably called by this
method.
Notice: The Service
created should not
close the streams. While it would be logical to close the
streams after reaching eof, this does not work well with streams
originating from a socket. Consequently the method which calls a
ServiceFactory
has to take care to eventually close
the streams. It is, however, a good idea to flush the output
stream.
createService
in interface ServiceFactory
p
- is an arbitrary parameter object that may be used to
tweak the service created beyond its input and output
stream. TcpServer
sets it to null
, but when
stacking up service factories (like with FilterServiceFactory
), this is useful.ServiceCreateException
- if the service is permanently
unavailable. To indicate that the service may be created the next
time this method is called, use ServiceUnavailException
.public static void main(java.lang.String[] argv) throws java.io.IOException, CompileDfaException, ReSyntaxException, CommandlineException
-h
to get a description.java.io.IOException
CompileDfaException
ReSyntaxException
CommandlineException