How to handle binary files¶

The previous section explains how to submit pure text files/documents in an structured format. However, what happens if your documents are in binary format? This means an Excel sheet, a Word document, a Power Point, a PDF or even a TXT file that you do not want to read and insert its text in the request body.

If this is your case, we can extract the text content from your files, with OCR (Optical Character Recognition) features. There are many of them available on the Internet that your could use on your end, nonetheless, the CERN Search service offers this features via Apache Tika.

In the following paragraphs we will explain how to enable and use this feature. Alternatively you can set up Tika and extract content on your own project.

Elasticsearch content extraction capabilities

Previously there was an option to send binary content as a base64 encoded string, and then rely on the ingest pipelines of Elasticsearch to extract the content. This option is no longer available, it is deprecated since February 2020.

List of supported formats

Currently we support every parser Tika offers except org.apache.tika.parser.ocr.TesseractOCRParser.

Content extraction¶

Disclaimer

Cern Search is not responsible to provide or store the files after uploading. The user must provide an url to access the file, external to the search domain. If not provided the file's content will be searchable but the result will not be linkable to any file.

Uploading a file is done in two requests, first the record metadata and then the associated file.

Note the most simple mapping for files, with only a custom field title.
Observe the required field url. This field must be sent by the user.

JSON schema¶

{
  "title": "Custom record schema v0.0.1",
  "id": "http://localhost:5000/schemas/test/file_v0.0.1.json",
  "$schema": "http://localhost:5000/schemas/test/file_v0.0.1.json",
  "type": "object",
  "properties": {
    "_data": {
      "type": "object",
      "title": {
        "type": "string",
        "description": "Record title."
      }
    },
    "url": {
      "type": "string"
    },
    ...
  }
}

Elasticsearch mapping¶

Files related fields required to be set in mapping are _bucket, _bucket_content, and file which will be managed by the search and will be ignored if sent by the user.
The content will be stored in the field content which is not configurable.

{
  "settings": { ... },
  "mappings": {
    "file_v0.0.1": {
      "properties": {
        "_data": {
          "type": "object",
          "properties": {
            "title": {
              "type": "keyword"
            },
            "content": {
              "type": "text"
            }
          }
        },
        "_bucket": {
          "type": "keyword"
        },
        "_bucket_content": {
          "type": "keyword"
        },
        "file": {
          "type": "keyword"
        },
        "url": {
          "type": "keyword"
        },
        ...
      }
    }
  }
}

Metadata extraction¶

If requested, additionally Tika's extracted metadata can be added to the document. Currently meta includes:

Authors: Field _data.authors
Content based classification: Field collection

Collection Mime type extension

Document doc, docx, odt, pages, rtf, tex, wpd, txt

PDF pdf

Sheet ods, xlsx, xlsm, xls, numbers

Slides ppt, pptx, pps, odp, key

Other all other
Name/Title: Field _data.name
Keywords: Field _data.keywords
Creation date: Field creation_date

Note additional fields in the following Elasticsearch mapping:

{
  "settings": { ... },
  "mappings": {
    "file_with_metadata_v0.0.1": {
      "properties": {
        "_data": {
          "type": "object",
          "properties": {
            "name": {
              "type": "text"
            },
            "keywords": {
              "type": "text"
            },
            "authors": {
              "type": "text"
            }
          }
        },
        "collection": {
          "type": "keyword"
        },
        "creation_date": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        },
        ...
      }
    }
  }
}

REST API¶

Create the metadata of the file¶

Required field

Note the required field url. This field is managed by the user.

$ curl -X POST -H 'Content-Type: application/json' https://<host:port>/api/records/ --data '
{
    "_access": { ... },
    "_data": {
        "title": "Demo metadata file"
    },
    "url": "https://cdn.my-external-domain.org/my-demo-file.pdf"
    "$schema": "http://<host:port>/schemas/test/file_v0.0.1.json"
}
'

The response will contain the CERN Search ID of the metadata, the control_number field used in the following requests.

Uploading the file¶

One-to-one

Currently is only allowed to associate one file to the metadata. If multiple are sent the most recent will overwrite the previous.

Imagine your file has my-demo-file.pdf as name and you are in the current directory, this is the query you should execute:

$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @my-demo-file.pdf

The file will be stored in a bucket and its content will be extracted and indexed.

Give it time, do not panic

Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is indexed to be searched.

Obtaining the file's extracted content¶

You can verify the extracted content with the following request. Until the content is successfully extracted the request will result in a 404.

$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>

Give it time, do not panic

Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is available.

Updating a file or metadata¶

File can be resubmitted and will be processed again. <file_name> is permitted to change.

$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @the-correct-file.pdf

Metadata can be updated and previously extracted file content will be kept indexed and searchable.

$ curl -X PUT -H "Content-type: application/json" https://localhost:8080/api/record/<control_number> -d '
{
    "_access": { ... },
    "_data": {
    "title": "Demo metadata file changed"
    },
    "url": "https://cdn.my-external-domain-changed.org/my-demo-file.pdf"
    "$schema": "http://0.0.0.0/schemas/test/file_v0.0.1.json",
}
'

Deleting a file or metadata¶

To update a file it's no necessary to delete the previous one before, but if for any reason is necessary can be done with the request. Reindexing without content will occur.

$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>

If metadata is deleted, associated file will no longer be accessible.

$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>


$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>
# will result in a 410 - Gone

Alternatives¶

Include the Tika library¶

Having Tika locally in your project (It is a library) is the optimal solution, since you will be able to customize the options it offers for your specific use case (timeouts, amount of extracted text, etc.).

To use it, first download Apache Tika, then decompress the file and place the library in the corresponding path so your application can find it.

tika-python

For Python, Tika is made available through pip and more documentation can be found here. Note that this library is not maintained by us.

As you can see in the following snippets, getting the content involves a simple call to the Tika parser:

```Python tab= import json import tika

from tika import parser

Initialize Tika¶

tika.initVM()

Query Tika¶

parsed = parser.from_file('test_blob.pdf') print(json.dumps(parsed, indent=2, sort_keys=True))

```Java tab=
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.xml.sax.SAXException;

import java.io.*;
import java.net.URL;

public static void queryTike() throws IOException {
    try {
        // Initialize Tika
        Tika tika = new Tika();

        // Query Tika
        parsed = tika.parseToString(new File("test_blob.pdf"))
        System.out.println();
    } catch (TikaException | SAXException e) {
        e.printStackTrace();
    }
}

More examples

More documentation and usage examples, in Java, are available on the official site.

Collection	Mime type extension
Document	doc, docx, odt, pages, rtf, tex, wpd, txt
PDF	pdf
Sheet	ods, xlsx, xlsm, xls, numbers
Slides	ppt, pptx, pps, odp, key
Other	all other