Tika On DotNet
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
Usage
It is best to take a dependency on the Nugets we produce:
- TikaOnDotNet.TextExtractor <- start here
- TikaOnDotNet
What is this?
This project contains all the .Net assemblies necessary to use the wonderful Tika library in your .Net applications.
Tika is a Apache Foundation open source project written in Java. It may sound scary but it is possible to leverage Java libraries from .Net applications without any TCP sockets or web services getting caught in the crossfire using IKVM. I’ve done the hard work for you and built the .Net version of Tika for you and bundled the supporting IKVM runtime libraies.
Tests
A basic set of unit tests are present in this project to verify that Tika is working. These tests extract text from test documents. The following rich document types are tested:
- Adobe PDF - .pdf
- Microsoft Word - .doc and .docx
- Microsoft Excel - .xls and .xlsx
- Microsoft PowerPoint - .ppt and .pptx
- Rich Text Format - .rtf
- Zip files - .zip (only a listing of the filenames in the .zip file are extracted)
- JPEG - .jpg (image metadata)
For more details on how this is accomplished checkout this blog post from @KevM
Cloning
You can use your favorite git client to clone this repository. Please do!
$ git clone git@github.com:KevM/tikaondotnet.git
$ cd tikadotnet
Authors and Contributors
This project was created by @KevM to support a project created by @DovetailSoftware.
Support or Contact
If you have any problems. Create an issue and we can talk about it.