org.apache.tika.sax
Class SafeContentHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.sax.ContentHandlerDecorator
org.apache.tika.sax.SafeContentHandler
- All Implemented Interfaces:
- ContentHandler, DTDHandler, EntityResolver, ErrorHandler
- Direct Known Subclasses:
- XHTMLContentHandler, XMPContentHandler
public class SafeContentHandler
- extends ContentHandlerDecorator
Content handler decorator that makes sure that the character events
(characters(char[], int, int) or
ignorableWhitespace(char[], int, int)) passed to the decorated
content handler contain only valid XML characters. All invalid characters
are replaced with spaces.
The XML standard defines the following Unicode character ranges as
valid XML characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note that currently this class only detects those invalid characters whose
UTF-16 representation fits a single char. Also, this class does not ensure
that the UTF-16 encoding of incoming characters is correct.
|
Nested Class Summary |
protected static interface |
SafeContentHandler.Output
Internal interface that allows both character and
ignorable whitespace content to be filtered the same way. |
| Methods inherited from class org.xml.sax.helpers.DefaultHandler |
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning |
SafeContentHandler
public SafeContentHandler(ContentHandler handler)
isInvalid
protected boolean isInvalid(int ch)
- Checks whether the given Unicode character is an invalid XML character
and should be replaced for output. Subclasses can override this method
to use an alternative definition of which characters should be replaced
in the XML output. The default definition from the XML specification is:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
- Parameters:
ch - character
- Returns:
true if the character should be replaced,
false otherwise
writeReplacement
protected void writeReplacement(SafeContentHandler.Output output)
throws SAXException
- Outputs the replacement for an invalid character. Subclasses can
override this method to use a custom replacement.
- Parameters:
output - where the replacement is written to
- Throws:
SAXException - if the replacement could not be written
startElement
public void startElement(String uri,
String localName,
String name,
Attributes atts)
throws SAXException
- Specified by:
startElement in interface ContentHandler- Overrides:
startElement in class ContentHandlerDecorator
- Throws:
SAXException
endElement
public void endElement(String uri,
String localName,
String name)
throws SAXException
- Specified by:
endElement in interface ContentHandler- Overrides:
endElement in class ContentHandlerDecorator
- Throws:
SAXException
endDocument
public void endDocument()
throws SAXException
- Specified by:
endDocument in interface ContentHandler- Overrides:
endDocument in class ContentHandlerDecorator
- Throws:
SAXException
characters
public void characters(char[] ch,
int start,
int length)
throws SAXException
- Specified by:
characters in interface ContentHandler- Overrides:
characters in class ContentHandlerDecorator
- Throws:
SAXException
ignorableWhitespace
public void ignorableWhitespace(char[] ch,
int start,
int length)
throws SAXException
- Specified by:
ignorableWhitespace in interface ContentHandler- Overrides:
ignorableWhitespace in class ContentHandlerDecorator
- Throws:
SAXException
Copyright © 2007-2012 The Apache Software Foundation. All Rights Reserved.