Skip to content

How to handle binary files

The previous section explains how to submit pure text files/documents in an structured format. However, what happens if your documents are in binary format? This means an Excel sheet, a Word document, a Power Point, a PDF or even a TXT file that you do not want to read and insert its text in the request body.

If this is your case, we can extract the text content from your files, with OCR (Optical Character Recognition) features. There are many of them available on the Internet that your could use on your end, nonetheless, the CERN Search service offers this features via Apache Tika.

In the following paragraphs we will explain how to enable and use this feature. Alternatively you can set up Tika and extract content on your own project.

Elasticsearch content extraction capabilities

Previously there was an option to send binary content as a base64 encoded string, and then rely on the ingest pipelines of Elasticsearch to extract the content. This option is no longer available, it is deprecated since February 2020.

List of supported formats

Currently we support every parser Tika offers except org.apache.tika.parser.ocr.TesseractOCRParser.

Content extraction

Disclaimer

Cern Search is not responsible to provide or store the files after uploading. The user must provide an url to access the file, external to the search domain. If not provided the file's content will be searchable but the result will not be linkable to any file.

Uploading a file is done in two requests, first the record metadata and then the associated file.

Note the most simple mapping for files, with only a custom field title.
Observe the required field url. This field must be sent by the user.

JSON schema

{
  "title": "Custom record schema v0.0.1",
  "id": "http://localhost:5000/schemas/test/file_v0.0.1.json",
  "$schema": "http://localhost:5000/schemas/test/file_v0.0.1.json",
  "type": "object",
  "properties": {
    "_data": {
      "type": "object",
      "title": {
        "type": "string",
        "description": "Record title."
      }
    },
    "url": {
      "type": "string"
    },
    ...
  }
}

Elasticsearch mapping

Files related fields required to be set in mapping are _bucket, _bucket_content, and file which will be managed by the search and will be ignored if sent by the user.
The content will be stored in the field content which is not configurable.

{
  "settings": { ... },
  "mappings": {
    "file_v0.0.1": {
      "properties": {
        "_data": {
          "type": "object",
          "properties": {
            "title": {
              "type": "keyword"
            },
            "content": {
              "type": "text"
            }
          }
        },
        "_bucket": {
          "type": "keyword"
        },
        "_bucket_content": {
          "type": "keyword"
        },
        "file": {
          "type": "keyword"
        },
        "url": {
          "type": "keyword"
        },
        ...
      }
    }
  }
}

Metadata extraction

If requested, additionally Tika's extracted metadata can be added to the document. Currently meta includes:

  • Authors: Field _data.authors
  • Content based classification: Field collection

    Collection Mime type extension
    Document doc, docx, odt, pages, rtf, tex, wpd, txt
    PDF pdf
    Sheet ods, xlsx, xlsm, xls, numbers
    Slides ppt, pptx, pps, odp, key
    Other all other
  • Name/Title: Field _data.name

  • Keywords: Field _data.keywords
  • Creation date: Field creation_date

Note additional fields in the following Elasticsearch mapping:

{
  "settings": { ... },
  "mappings": {
    "file_with_metadata_v0.0.1": {
      "properties": {
        "_data": {
          "type": "object",
          "properties": {
            "name": {
              "type": "text"
            },
            "keywords": {
              "type": "text"
            },
            "authors": {
              "type": "text"
            }
          }
        },
        "collection": {
          "type": "keyword"
        },
        "creation_date": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        },
        ...
      }
    }
  }
}

REST API

Create the metadata of the file

Required field

Note the required field url. This field is managed by the user.

$ curl -X POST -H 'Content-Type: application/json' https://<host:port>/api/records/ --data '
{
    "_access": { ... },
    "_data": {
        "title": "Demo metadata file"
    },
    "url": "https://cdn.my-external-domain.org/my-demo-file.pdf"
    "$schema": "http://<host:port>/schemas/test/file_v0.0.1.json"
}
'

The response will contain the CERN Search ID of the metadata, the control_number field used in the following requests.

Uploading the file

One-to-one

Currently is only allowed to associate one file to the metadata. If multiple are sent the most recent will overwrite the previous.

Imagine your file has my-demo-file.pdf as name and you are in the current directory, this is the query you should execute:

$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @my-demo-file.pdf

The file will be stored in a bucket and its content will be extracted and indexed.

Give it time, do not panic

Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is indexed to be searched.

Obtaining the file's extracted content

You can verify the extracted content with the following request. Until the content is successfully extracted the request will result in a 404.

$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>

Give it time, do not panic

Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is available.

Updating a file or metadata

File can be resubmitted and will be processed again. <file_name> is permitted to change.

$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @the-correct-file.pdf

Metadata can be updated and previously extracted file content will be kept indexed and searchable.

$ curl -X PUT -H "Content-type: application/json" https://localhost:8080/api/record/<control_number> -d '
{
    "_access": { ... },
    "_data": {
    "title": "Demo metadata file changed"
    },
    "url": "https://cdn.my-external-domain-changed.org/my-demo-file.pdf"
    "$schema": "http://0.0.0.0/schemas/test/file_v0.0.1.json",
}
'

Deleting a file or metadata

To update a file it's no necessary to delete the previous one before, but if for any reason is necessary can be done with the request. Reindexing without content will occur.

$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>

If metadata is deleted, associated file will no longer be accessible.

$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>


$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>
# will result in a 410 - Gone

Alternatives

Include the Tika library

Having Tika locally in your project (It is a library) is the optimal solution, since you will be able to customize the options it offers for your specific use case (timeouts, amount of extracted text, etc.).

To use it, first download Apache Tika, then decompress the file and place the library in the corresponding path so your application can find it.

tika-python

For Python, Tika is made available through pip and more documentation can be found here. Note that this library is not maintained by us.

As you can see in the following snippets, getting the content involves a simple call to the Tika parser:

```Python tab= import json import tika

from tika import parser

Initialize Tika

tika.initVM()

Query Tika

parsed = parser.from_file('test_blob.pdf') print(json.dumps(parsed, indent=2, sort_keys=True))

```Java tab=
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.xml.sax.SAXException;

import java.io.*;
import java.net.URL;

public static void queryTike() throws IOException {
    try {
        // Initialize Tika
        Tika tika = new Tika();

        // Query Tika
        parsed = tika.parseToString(new File("test_blob.pdf"))
        System.out.println();
    } catch (TikaException | SAXException e) {
        e.printStackTrace();
    }
}

More examples

More documentation and usage examples, in Java, are available on the official site.