Package org.apache.tika.parser.microsoft
Class EMFParser
java.lang.Object
org.apache.tika.parser.microsoft.EMFParser
- All Implemented Interfaces:
Serializable,org.apache.tika.parser.Parser
Extracts files embedded in EMF and offers a
very rough capability to extract text if there
is text stored in the EMF.
To improve text extraction, we'd have to implement
quite a bit more at the POI level. We'd want to track changes
in font and use that information for identifying character sets,
inserting spaces and new lines.
We're also relying on storage order for text order, which isn't great.
We'd have to do something like what PDFBox or XPS do to sort the
runs and then put the cow back together from the hamburger...lol...
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic org.apache.tika.metadata.Propertystatic org.apache.tika.metadata.Property -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionSet<org.apache.tika.mime.MediaType>getSupportedTypes(org.apache.tika.parser.ParseContext context) voidparse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context)
-
Field Details
-
EMF_ICON_ONLY
public static org.apache.tika.metadata.Property EMF_ICON_ONLY -
EMF_ICON_STRING
public static org.apache.tika.metadata.Property EMF_ICON_STRING
-
-
Constructor Details
-
EMFParser
public EMFParser()
-
-
Method Details
-
getSupportedTypes
public Set<org.apache.tika.mime.MediaType> getSupportedTypes(org.apache.tika.parser.ParseContext context) - Specified by:
getSupportedTypesin interfaceorg.apache.tika.parser.Parser
-
parse
public void parse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws IOException, SAXException, org.apache.tika.exception.TikaException - Specified by:
parsein interfaceorg.apache.tika.parser.Parser- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-