|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.parser.html.BoilerpipeContentHandler
public class BoilerpipeContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a ContentHandler object passed to
HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
| Constructor Summary | |
|---|---|
BoilerpipeContentHandler(ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler. |
|
BoilerpipeContentHandler(ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. |
|
BoilerpipeContentHandler(Writer writer)
Creates a content handler that writes XHTML body character events to the given writer. |
|
| Method Summary | |
|---|---|
void |
characters(char[] chars,
int offset,
int length)
|
void |
endDocument()
|
void |
endElement(String uri,
String localName,
String qName)
|
de.l3s.boilerpipe.document.TextDocument |
getTextDocument()
Retrieves the built TextDocument |
boolean |
isIncludeMarkup()
|
void |
setIncludeMarkup(boolean includeMarkup)
|
void |
startDocument()
|
void |
startElement(String uri,
String localName,
String qName,
Attributes atts)
|
void |
startPrefixMapping(String prefix,
String uri)
|
| Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler |
|---|
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public BoilerpipeContentHandler(ContentHandler delegate)
DefaultExtractor extraction rules and "delegate" as the content handler.
delegate - The ContentHandler objectpublic BoilerpipeContentHandler(Writer writer)
writer - writer
public BoilerpipeContentHandler(ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
delegate - The ContentHandler objectextractor - Extraction rules to use, e.g. ArticleExtractor| Method Detail |
|---|
public void setIncludeMarkup(boolean includeMarkup)
public boolean isIncludeMarkup()
public de.l3s.boilerpipe.document.TextDocument getTextDocument()
public void startDocument()
throws SAXException
startDocument in interface ContentHandlerstartDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
public void startPrefixMapping(String prefix,
String uri)
throws SAXException
startPrefixMapping in interface ContentHandlerstartPrefixMapping in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
public void startElement(String uri,
String localName,
String qName,
Attributes atts)
throws SAXException
startElement in interface ContentHandlerstartElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
public void characters(char[] chars,
int offset,
int length)
throws SAXException
characters in interface ContentHandlercharacters in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
public void endElement(String uri,
String localName,
String qName)
throws SAXException
endElement in interface ContentHandlerendElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
public void endDocument()
throws SAXException
endDocument in interface ContentHandlerendDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerSAXException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||