Tika On DotNet

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Usage

It is best to take a dependency on the Nugets we produce:

What is this?

This project contains all the .Net assemblies necessary to use the wonderful Tika library in your .Net applications.

Tika is a Apache Foundation open source project written in Java. It may sound scary but it is possible to leverage Java libraries from .Net applications without any TCP sockets or web services getting caught in the crossfire using IKVM. I’ve done the hard work for you and built the .Net version of Tika for you and bundled the supporting IKVM runtime libraies.

Tests

A basic set of unit tests are present in this project to verify that Tika is working. These tests extract text from test documents. The following rich document types are tested:

For more details on how this is accomplished checkout this blog post from @KevM

Cloning

You can use your favorite git client to clone this repository. Please do!

$ git clone git@github.com:KevM/tikaondotnet.git
$ cd tikadotnet

Authors and Contributors

This project was created by @KevM to support a project created by @DovetailSoftware.

Support or Contact

If you have any problems. Create an issue and we can talk about it.