org.apache.tika.parser.html
Class BoilerpipeContentHandler
java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.parser.html.BoilerpipeContentHandler
- All Implemented Interfaces:
- ContentHandler
public class BoilerpipeContentHandler
- extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a ContentHandler object passed to
HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
| Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler |
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate)
- Creates a new boilerpipe-based content extractor, using the
DefaultExtractor extraction rules and "delegate" as the content handler.
- Parameters:
delegate - The ContentHandler object
BoilerpipeContentHandler
public BoilerpipeContentHandler(Writer writer)
- Creates a content handler that writes XHTML body character events to
the given writer.
- Parameters:
writer - writer
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
- Creates a new boilerpipe-based content extractor, using the given
extraction rules. The extracted main content will be passed to the
content handler.
- Parameters:
delegate - The ContentHandler objectextractor - Extraction rules to use, e.g. ArticleExtractor
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
isIncludeMarkup
public boolean isIncludeMarkup()
startDocument
public void startDocument()
throws SAXException
- Specified by:
startDocument in interface ContentHandler- Overrides:
startDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
startPrefixMapping
public void startPrefixMapping(String prefix,
String uri)
throws SAXException
- Specified by:
startPrefixMapping in interface ContentHandler- Overrides:
startPrefixMapping in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
startElement
public void startElement(String uri,
String localName,
String qName,
Attributes atts)
throws SAXException
- Specified by:
startElement in interface ContentHandler- Overrides:
startElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
characters
public void characters(char[] chars,
int offset,
int length)
throws SAXException
- Specified by:
characters in interface ContentHandler- Overrides:
characters in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
endElement
public void endElement(String uri,
String localName,
String qName)
throws SAXException
- Specified by:
endElement in interface ContentHandler- Overrides:
endElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
endDocument
public void endDocument()
throws SAXException
- Specified by:
endDocument in interface ContentHandler- Overrides:
endDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.