Rokkon Parser Module

Status: Module is running

Overview

The Parser Module is a document processing service that extracts text and metadata from various document formats. It uses Apache Tika to parse documents and extract their content, making them available for further processing in the Rokkon pipeline.

Supported Document Types

The Parser Module supports a wide range of document formats, including:

API

The Parser Module implements the standard Rokkon PipeStepProcessor gRPC service interface. It receives documents through gRPC and returns the parsed content and metadata.

Configuration

The Parser Module can be configured with various options to control the parsing process:

{
  "maxContentLength": 10000000,
  "extractMetadata": true,
  "enableTitleExtraction": true,
  "disableEmfParser": false,
  "enableGeoTopicParser": false,
  "logParsingErrors": true
}