Create, delete, update and read documents¶
In this section we explain how to perform the CRUD operations against your instance, to do so bash scripts are shown. Nonetheless, there are Python and Java code snippets available in the examples section).
Read the permissions section before proceeding
Read carefully the permissions section before inserting or querying documents, since the _access
field has to be properly defined in order to achieve the desired behaviour.
Insert documents¶
In order to upload a document you need to perform a POST
operation. For example using curl:
Quote properly your variables
When using curl
or similar tools, the usage of single or double quotes makes a difference. For example bash -H 'Authorization: $TOKEN'
in yout terminal would send that same string ($TOKEN
) as token. Whereas, if double quotes are used the value of $TOKEN
will be interpreted, sending then 'Authorization: value_of_the_env_var_token'.
curl -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/' --data '
{
"_access": {
"delete": ["test-egroup@cern.ch"],
"owner": ["test-egroup@cern.ch"],
"read": ["test-egroup@cern.ch", "test-egroup-two@cern.ch"],
"update": ["test-egroup@cern.ch"]
},
"class": "A",
"description": "This is an awesome description for our first uploaded document",
"title": "Demo document"
"$schema": "http://<host:port>/schemas/test/doc_v0.0.1.json"
}
'
Do not forget the $schema
field
The $schema
field is not mandatory. However, if it is not set, the document will be inserted in the default schema (defined upon instance creation).
We will provide you with a list of values for your types of documents.
Value should follow the standard http://<host>/schemas/<search_instance>/<index_doc_type>.json
, for example: http://my-search-domain.web.cern.ch/schemas/my-search/doc_v1.0.0.json
The response should be a code 200 (ok)
with a selflink to the newly inserted document. It should be similar to the url of the next query. With it you can obtain the document:
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'https://<host:port>/api/record/1'
Query documents¶
In order to query documents you need to perform a GET
operation. An example query for the words awesome and document looks like this:
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/records/?q=awesome+document'
The answer would be similar to:
{
"aggregations": {},
"hits": {
"hits": [
{
"created": "2018-03-19T08:16:53.218017+00:00",
"explanation": {},
"highlight": {},
"id": 5,
"links": {
"self": "http://<host:port>/api/record/5"
},
"metadata": {
"_access": {<access details>},
"control_number": "5",
"class": "B",
"description": "This is an awesome description for our first uploaded document",
"title": "Demo document"
},
"updated": "2018-03-19T08:16:53.218042+00:00"
}
],
"total": 2
},
"links": {
"prev": "http://<host:port>/api/records/?page=1&size=1",
"self": "http://<host:port>/api/records/?page=2&size=1"
}
}
Links field
The links
field is very useful to process the results. It allows you to get the current, next and previous pages (The first and last page only contain the next
and prev
links respectively).
Useful parameters¶
Iterate with page
and size
¶
You can specify the amount of documents to be returned (per page and the page you want), among other options.
As mentioned in the previous paragraph, we can use pagination to restrict the amount of results.
For example we can to obtain the second page (Parameter page
)of a query that gets one element (Parameter size
) per page:
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=awesome+document&page=2&size=1'
Understand order of results with explain
¶
To better understand Elasticsearch scoring algorithm use the development explain
query parameter.
It enables explanation for each hit on how its score was computed.
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=example&explain=true'
Results will be seen under the field explanation
in the result.
Toggle to expand example response
{
"aggregations": {},
"hits": {
"hits": [
{
"created": "2018-03-19T08:16:53.218017+00:00",
"explanation": {
"value": 1.6943599,
"description": "weight(_data:field:example in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.6943599,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 1.3862944,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1.0,
"description": "docFreq",
"details": []
},
{
"value": 5.0,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.2222223,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1.0,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 5.4,
"description": "avgFieldLength",
"details": []
},
{
"value": 3.0,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
...
}
],
"total": 2
},
"links": {
"prev": "http://<host:port>/api/records/?page=1&size=1",
"self": "http://<host:port>/api/records/?page=2&size=1"
}
}
Check matched terms with highlight
¶
From Elasticsearch docs on highlighting:
Highlighters enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are. When you request highlights, the response contains an additional highlight element for each search hit that includes the highlighted fields and the highlighted fragments.
Send list of fields you want to highlight in the following format:
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=Lorem+ipsum+dolor&highlight=_data.*&highlight=other_field'
You can send wildcards to match all inner fields of some field, ie. _data.*
.
Field highlight
will follow this format:
...
"highlight": {
"_data.my_field": [
"<em>Lorem</em> <em>ipsum</em> <em>dolor</em> sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et",
"Duis aute irure <em>dolor</em> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur"
]
}
...
Advanced queries¶
The query parameter q
or query
specified in the URL can contain boolean expressions.
For example q=+value1 +value2
will match only when both value1
and value2
are matched. Be careful to url encond the +
sign.
In addition, it accepts queries per field in the format of field:value
, where value can be a string or a boolean expression. In the later case it has to be surrounded by parenthesis.
For example: q=field_x.field_y:quick +(field_3:brown field_3:fox)
which will match when either brown or fox
match field3
and a match quick
in field field_x.field_y
is optional.
Can also be written as q=field_x.field_y:quick +field_3:(brown || fox)
.
You can also manipulate the weights in the query, ie. q=field1:value1 field2:(value1)^2
.
From ES documentation, some important notes regarding boolean operators:
By default, all terms are optional, as long as one term matches. A search for
foo bar baz
will find any document that contains one or more offoo
orbar
orbaz
. We have already discussed the default_operator above which allows you to force all terms to be required, but there are also boolean operators which can be used in the query string itself to provide more control.The preferred operators are
+
(this term must be present) and-
(this term must not be present). All other terms are optional. For example, this query:
quick brown +fox -news
states that:
fox
must be presentnews
must not be presentquick
andbrown
are optional — their presence increases the relevanceThe familiar operators
AND
,OR
andNOT
(also written&&
,||
and!
) are also supported. However, the effects of these operators can be more complicated than is obvious at first glance.NOT
takes precedence overAND
, which takes precedence overOR
. While the+
and-
only affect the term to the right of the operator,AND
andOR
can affect the terms to the left and right.
Refines or filters
Field specific queries are very helpful to provide filters, mostly when you know it can take a definite set of values. For example by document type (doc, ppt, pdf, etc.).
Escape special characters
Remember to escape reserved characters if they have to be searched for and not interpreted. The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
. In addition, if you are running the query from the terminal, it might be necessary to encode these characters. Lastly substitute spaces by +
so the URL is properly formatted.
For more details and examples check the ES documentation on Query DSL » Full text queries » Query String Query. Currently we only support the field query
(not fields
nor default_field
and so on).
Scoring type¶
Use query parameter type
to change how the query matches and scores documents.
Values are those permitted by Elasticsearch as documented in valid values for type.
Defaults to best_fields
.
Example:
curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=author+book+year&type=cross_fields'
Wildcards¶
There are two characters that can be use as wildcards. The question mark, ?
, which would imply any value but just one time. For example: val?e
would match value
and val8e
but not valuu88e
. The later would be matched by val*e
, which implies any value and any amount of times (including 0, so it would also match vale
). In the case that you have to query for ?
or *
as a value and not as a wildcard, you just have to be escape them.
Most of the wildcard queries have to be done over a keyword
analyzed field, meaning "exact match". Assuming a field named "wildtest" defined as:
"wildtest": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
Where a record with wildtest: ADB15*/test
exists, we could perform the following queries:
- With a wildcard:
q=wildtest:ADB
. - Looking for the
*
character:q=wildtest:ADB15\*
.
Due to the way text
data is analized with the default analyzer, more complex queries need to be done over the keyword
argument:
- Match the beginning of the word, followed by a wildcard, followed by an exact match of the
*
and\
characters:q=wildtest.keyword:ADB*\*\/*
. The first and last*
are not escaped, therefore they work as a wildcard. The/
is escaped because it is a special character. The other*
is escaped so it looks for the character*
in the string. - A similar variation from the above but using a wildcard at the beggining. In addition, the
?
is present. It works as a wildcard but of only one character (therefore two are needed to account for1
and5
in the string):q=wildtest.keyword:*DB??\*\/*
.
Encode characters in your terminal
If you execute these queries via curl
or similar terminal tools, be aware that some charaters need to be url-encoded (e.g. \
is %5C
, /
is %2F
).
Range and date queries¶
When querying for dates you can query for a range (appart from an exact one). The syntax follows the elasticsearch DSL guidelines.
Assuming your document has a field called modification_date
with which is of type date
with format YYYY-MM-DD'T'HH:mm:ss
,
a query for the documents modified (since it's the modification date field) in a precise time would be:
$ curl -k -XGET -H "Content-Type: application/json" 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=modification_date:2019-08-05T17%5C:26%5C:15"
Encode and escape characters in your terminal
If you execute these queries via curl
or similar terminal tools, be aware that some charaters need to be url-encoded (e.g. \
is %5C
). In addition, :
need to be escaped in order not to in interpreted as a field : value
comparisson.
For range queries you can use {}
and []
, being the first an exclusive range and the second one an inclusive range.
This means you can query modification_date:{* TO "2019-01-01T12:00:00"}
and modification_date:[* TO "2019-01-01T12:00:00"]
.
The first one will return all documents whose modification date was previous (*
) to the first of January 2019 at mid-day, while the second query will include also those created at mid-day.
Again if you run these queries in a terminal you will need to urlencode them first.
Update documents¶
To update a document you need to perform a PUT
operation. Therefore, the ID
or control_number
of the record is part of the URL.
curl -X PUT -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/record/<ID>' --data '
{
"description": "This is an awesome updated description",
"title": "Update Test 1",
"class": "Z"
}
'
Documents can also be partially updated. In this case, you need to perform a PATCH
. The endpoint accepts application/json+patch
as Content-Type
. An example to update the description would be:
curl -k -X PATCH -H 'Content-Type: application/json-patch+json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'https://<host:port>/api/record/9' --data '
[
{
"op": "replace",
"path": "/description",
"value": "Description changed with patch partial update"}
]'
Path and trailing slashes
Be careful not to leave a final trailing slash in the path attribute, because the query will fail since it does not exist.
JSON Patch
More about the options of json+patch can be found here.
Update By Query¶
There might be some cases where there are nested documents that need to be updated depending on a value/condition. For instance, imagine a document that contains the key "articles", whose value is an array of "title" and "description":
{
"id": 1,
"articles": [
{
"title": "Article one",
"description": "It was a really good article"
},
{
"title": "Article one, part two",
"description": "It was a really bad article"
},
{
"title": "No imagination for titles",
"description": "Another random description"
},
]
}
As you can see in order to update some value in the field "articles" there are three options:
- Use the
PUT
operation and send the full new document. - Use the
PATCH
operation and send thereplace
of the whole "articles" field. - Use the
update by query
endpoint.'https://<host:port>/api/ubq/<record_id>'
, where record id is the recordcontrol_number
, the ID that would be use if you wanted to do aGET
of that document.
The difference of this endpoint with respect to the other two is that you will send how to modify the field/document. Giving you more flexibility in the kind of updates you can perform. You need to send the script
that will be executed in the ubq
field in the body of the requests.
For example:
{
"ubq" : {
"script": {
"source": "for (item in ctx._source.articles){if (item.title == 'No imagination for titles'){item.content='Updated content via RESTful API script'}}",
"lang": "painless"
}
},
"$schema": "http://<host:port>/schemas/<search_instance>/<index_doc_type>.json"
}
You can take the JSON example above and change the content of the field "script" to make your own. However, in most of the cases changing just the fields and the updated values should suffice.
This endpoint accepts all kind of scripts that are accepted by the Update By Query operation in Elasticsearch.
Disclaimer
This operation might produce inconsistencies between Elasticsearch and PostgreSQL storage (indexed data vs historic/backup data) if your script does not function as expected. The Search service is not responsible for any data corruption or inconsistency produced by these operations. Please test your scripts in a local ES instance before proceeding. Although, the platform is able to detect when these inconsistency errors occur and will inform you, it is not able to fix them since there is no way to know which data is the correct one. The fastest way of solving the inconsistency issues is to delete the document and index it again.
Info
As of 01-03-2019 bulk updates by query, meaning for more than one document at a time, are not yet supported. We are working to enable this.
Delete documents¶
To delete a document you need to perform a DELETE
operation. For this you simply need to specify the document ID
or control_number
in the url:
curl -XDELETE -H 'Content-Type: application/json' -H 'Accept: application/json' \
-H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/record/<ID>'
If afterwards you query (GET, PUT, PATCH, DELETE) for the specific item you will obtain a 410:
{
"status": 410,
"message": "PID has been deleted."
}