How to handle binary files¶
The previous section explains how to submit pure text files/documents in an structured format. However, what happens if your documents are in binary format? This means an Excel sheet, a Word document, a Power Point, a PDF or even a TXT file that you do not want to read and insert its text in the request body.
If this is your case, we can extract the text content from your files, with OCR (Optical Character Recognition) features. There are many of them available on the Internet that your could use on your end, nonetheless, the CERN Search service offers this features via Apache Tika.
In the following paragraphs we will explain how to enable and use this feature. Alternatively you can set up Tika and extract content on your own project.
Elasticsearch content extraction capabilities
Previously there was an option to send binary content as a base64 encoded string, and then rely on the ingest
pipelines of Elasticsearch to extract the content. This option is no longer available, it is deprecated since February 2020.
Currently we support every parser Tika offers except org.apache.tika.parser.ocr.TesseractOCRParser
.
Content extraction¶
Disclaimer
Cern Search is not responsible to provide or store the files after uploading. The user must provide an url to access the file, external to the search domain. If not provided the file's content will be searchable but the result will not be linkable to any file.
Uploading a file is done in two requests, first the record metadata and then the associated file.
Note the most simple mapping for files, with only a custom field title
.
Observe the required field url
. This field must be sent by the user.
JSON schema¶
{
"title": "Custom record schema v0.0.1",
"id": "http://localhost:5000/schemas/test/file_v0.0.1.json",
"$schema": "http://localhost:5000/schemas/test/file_v0.0.1.json",
"type": "object",
"properties": {
"_data": {
"type": "object",
"title": {
"type": "string",
"description": "Record title."
}
},
"url": {
"type": "string"
},
...
}
}
Elasticsearch mapping¶
Files related fields required to be set in mapping are _bucket
, _bucket_content
, and file
which will be managed by
the search and will be ignored if sent by the user.
The content will be stored in the field content
which is not configurable.
{
"settings": { ... },
"mappings": {
"file_v0.0.1": {
"properties": {
"_data": {
"type": "object",
"properties": {
"title": {
"type": "keyword"
},
"content": {
"type": "text"
}
}
},
"_bucket": {
"type": "keyword"
},
"_bucket_content": {
"type": "keyword"
},
"file": {
"type": "keyword"
},
"url": {
"type": "keyword"
},
...
}
}
}
}
Metadata extraction¶
If requested, additionally Tika's extracted metadata can be added to the document. Currently meta includes:
- Authors: Field
_data.authors
-
Content based classification: Field
collection
Collection Mime type extension Document doc, docx, odt, pages, rtf, tex, wpd, txt PDF pdf Sheet ods, xlsx, xlsm, xls, numbers Slides ppt, pptx, pps, odp, key Other all other -
Name/Title: Field
_data.name
- Keywords: Field
_data.keywords
- Creation date: Field
creation_date
Note additional fields in the following Elasticsearch mapping:
{
"settings": { ... },
"mappings": {
"file_with_metadata_v0.0.1": {
"properties": {
"_data": {
"type": "object",
"properties": {
"name": {
"type": "text"
},
"keywords": {
"type": "text"
},
"authors": {
"type": "text"
}
}
},
"collection": {
"type": "keyword"
},
"creation_date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
...
}
}
}
}
REST API¶
Create the metadata of the file¶
Required field
Note the required field url
. This field is managed by the user.
$ curl -X POST -H 'Content-Type: application/json' https://<host:port>/api/records/ --data '
{
"_access": { ... },
"_data": {
"title": "Demo metadata file"
},
"url": "https://cdn.my-external-domain.org/my-demo-file.pdf"
"$schema": "http://<host:port>/schemas/test/file_v0.0.1.json"
}
'
The response will contain the CERN Search ID of the metadata, the control_number
field used in the following requests.
Uploading the file¶
One-to-one
Currently is only allowed to associate one file to the metadata. If multiple are sent the most recent will overwrite the previous.
Imagine your file has my-demo-file.pdf
as name and you are in the current directory, this is the query you should execute:
$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @my-demo-file.pdf
The file will be stored in a bucket and its content will be extracted and indexed.
Give it time, do not panic
Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is indexed to be searched.
Obtaining the file's extracted content¶
You can verify the extracted content with the following request. Until the content is successfully extracted the request will result in a 404.
$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>
Give it time, do not panic
Content extraction is a heavy operation and therefore it might take a few seconds until the content of your file is available.
Updating a file or metadata¶
File can be resubmitted and will be processed again. <file_name>
is permitted to change.
$ curl -X PUT https://<host:port>/api/record/<control_number>/files/<file_name> -H "Content-Type: application/octet-stream" --data-binary @the-correct-file.pdf
Metadata can be updated and previously extracted file content will be kept indexed and searchable.
$ curl -X PUT -H "Content-type: application/json" https://localhost:8080/api/record/<control_number> -d '
{
"_access": { ... },
"_data": {
"title": "Demo metadata file changed"
},
"url": "https://cdn.my-external-domain-changed.org/my-demo-file.pdf"
"$schema": "http://0.0.0.0/schemas/test/file_v0.0.1.json",
}
'
Deleting a file or metadata¶
To update a file it's no necessary to delete the previous one before, but if for any reason is necessary can be done with the request. Reindexing without content will occur.
$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>
If metadata is deleted, associated file will no longer be accessible.
$ curl -X DELETE -H "Content-type: application/json" https://<host:port>/api/record/<control_number>
$ curl -X GET -H "Content-type: application/json" https://<host:port>/api/record/<control_number>/files/<file_name>
# will result in a 410 - Gone
Alternatives¶
Include the Tika library¶
Having Tika locally in your project (It is a library) is the optimal solution, since you will be able to customize the options it offers for your specific use case (timeouts, amount of extracted text, etc.).
To use it, first download Apache Tika, then decompress the file and place the library in the corresponding path so your application can find it.
tika-python
For Python, Tika is made available through pip
and more documentation can be found here. Note that this library is not maintained by us.
As you can see in the following snippets, getting the content involves a simple call to the Tika parser:
```Python tab= import json import tika
from tika import parser
Initialize Tika¶
tika.initVM()
Query Tika¶
parsed = parser.from_file('test_blob.pdf') print(json.dumps(parsed, indent=2, sort_keys=True))
```Java tab=
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.xml.sax.SAXException;
import java.io.*;
import java.net.URL;
public static void queryTike() throws IOException {
try {
// Initialize Tika
Tika tika = new Tika();
// Query Tika
parsed = tika.parseToString(new File("test_blob.pdf"))
System.out.println();
} catch (TikaException | SAXException e) {
e.printStackTrace();
}
}
More examples
More documentation and usage examples, in Java, are available on the official site.