public class EncodingDetector
extends java.lang.Object
provides static methods to guess the character encoding used in an
InputStream
supposedly containing XML or HTML.
Note: Only parts of the recommendation mentioned below are implemented.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
defaultEnc
the platform's default encoding determined by opening an
InputStreamReader on System.in and
asking for its encoding. |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
detect(java.io.InputStream in)
tries to detect the character encoding used in the given
InputStream within the first 1000 bytes. |
static java.lang.String |
detect(java.io.InputStream in,
int limit,
java.lang.String deflt)
reads up to
limit bytes from the given input stream
to find out the character encoding used. |
public static final java.lang.String defaultEnc
the platform's default encoding determined by opening an
InputStreamReader
on System.in
and
asking for its encoding.
public static java.lang.String detect(java.io.InputStream in, int limit, java.lang.String deflt) throws java.io.IOException
reads up to limit
bytes from the given input stream
to find out the character encoding used. If no encoding can be
derived, the given default value is returned.
This method implements a partial and slightly modified version of the recommendations described in the XML specification.
With a Byte Order Mark: These seem to apply also to
HTML. If any of those is recognized, this method returns
immediatly with the appropriate encoding name. However, for
UCS-4 with unusual
octet order, deflt
is returned in lack of a useful
Java encoding name. The possible return values in this case are
UTF-32BE
, UTF-32LE
,
UTF-16BE
, UTF-16LE
, UTF-8
.
Without a Byte Order Mark: To cover HTML too, first the
position of the '<'
byte is detected in the first
four byte. This is taken as an indication of how many bytes have
to be read per character and which of those contains the ASCII
equivalent of the character — at least until the encoding
name was found. If no 0x3C
is found,
deflt
is returned immediatly. In particular this
means that EBCDIC is not handled by this implemention.
After the byte setup has been guessed, the
input stream is scanned for up to limit
bytes to
either find an
XML
declaration or an HTML
meta
tag
which describes the content type and character set used. The
possible return values are whatever was found as either
encoding
(XML) or as charset
(HTML) in
the file.
Under all circumstances, the InputStream
is reset to
it start before this method returns. This requires that the
InputStream
supports the mark()
method.
in
- the input stream to readlimit
- maximum number of bytes to read for guessingdeflt
- a default value to return when nothing can be guessed;
consider passing in defaultEnc
deflt
.java.lang.IllegalArgumentException
- if in
does not
support the mark()
method.java.io.IOException
public static java.lang.String detect(java.io.InputStream in) throws java.io.IOException
tries to detect the character encoding used in the given
InputStream
within the first 1000 bytes. It returns
the defaultEnc
if no encoding can be
guessed.
java.io.IOException
detect(InputStream,int,String)