Class EMFParser

java.lang.Object
org.apache.tika.parser.microsoft.EMFParser
All Implemented Interfaces:
Serializable, org.apache.tika.parser.Parser

public class EMFParser extends Object implements org.apache.tika.parser.Parser
Extracts files embedded in EMF and offers a very rough capability to extract text if there is text stored in the EMF.

To improve text extraction, we'd have to implement quite a bit more at the POI level. We'd want to track changes in font and use that information for identifying character sets, inserting spaces and new lines.

We're also relying on storage order for text order, which isn't great. We'd have to do something like what PDFBox or XPS do to sort the runs and then put the cow back together from the hamburger...lol...

See Also:
  • Field Details

    • EMF_ICON_ONLY

      public static org.apache.tika.metadata.Property EMF_ICON_ONLY
    • EMF_ICON_STRING

      public static org.apache.tika.metadata.Property EMF_ICON_STRING
  • Constructor Details

    • EMFParser

      public EMFParser()
  • Method Details

    • getSupportedTypes

      public Set<org.apache.tika.mime.MediaType> getSupportedTypes(org.apache.tika.parser.ParseContext context)
      Specified by:
      getSupportedTypes in interface org.apache.tika.parser.Parser
    • parse

      public void parse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Specified by:
      parse in interface org.apache.tika.parser.Parser
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException