Skip to content

Create, delete, update and read documents

In this section we explain how to perform the CRUD operations against your instance, to do so bash scripts are shown. Nonetheless, there are Python and Java code snippets available in the examples section).

Read the permissions section before proceeding

Read carefully the permissions section before inserting or querying documents, since the _access field has to be properly defined in order to achieve the desired behaviour.

Insert documents

In order to upload a document you need to perform a POST operation. For example using curl:

Quote properly your variables

When using curl or similar tools, the usage of single or double quotes makes a difference. For example bash -H 'Authorization: $TOKEN' in yout terminal would send that same string ($TOKEN) as token. Whereas, if double quotes are used the value of $TOKEN will be interpreted, sending then 'Authorization: value_of_the_env_var_token'.

curl -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' \
    -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/' --data '
      {
        "_access": {
          "delete": ["test-egroup@cern.ch"],
          "owner": ["test-egroup@cern.ch"],
          "read": ["test-egroup@cern.ch", "test-egroup-two@cern.ch"],
          "update": ["test-egroup@cern.ch"]
        },
        "class": "A",
        "description": "This is an awesome description for our first uploaded document",
        "title": "Demo document"
        "$schema": "http://<host:port>/schemas/test/doc_v0.0.1.json"
      }
      '

Do not forget the $schema field

The $schema field is not mandatory. However, if it is not set, the document will be inserted in the default schema (defined upon instance creation). We will provide you with a list of values for your types of documents. Value should follow the standard http://<host>/schemas/<search_instance>/<index_doc_type>.json, for example: http://my-search-domain.web.cern.ch/schemas/my-search/doc_v1.0.0.json

The response should be a code 200 (ok) with a selflink to the newly inserted document. It should be similar to the url of the next query. With it you can obtain the document:

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'https://<host:port>/api/record/1'

Query documents

In order to query documents you need to perform a GET operation. An example query for the words awesome and document looks like this:

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/records/?q=awesome+document'

The answer would be similar to:

{
  "aggregations": {},
  "hits": {
    "hits": [
      {
        "created": "2018-03-19T08:16:53.218017+00:00",
        "explanation": {},
        "highlight": {},
        "id": 5,
        "links": {
          "self": "http://<host:port>/api/record/5"
        },
        "metadata": {
          "_access": {<access details>},
          "control_number": "5",
          "class": "B",
          "description": "This is an awesome description for our first uploaded document",
          "title": "Demo document"
        },
        "updated": "2018-03-19T08:16:53.218042+00:00"
      }
    ],
    "total": 2
  },
  "links": {
    "prev": "http://<host:port>/api/records/?page=1&size=1",
    "self": "http://<host:port>/api/records/?page=2&size=1"
  }
}

Links field

The links field is very useful to process the results. It allows you to get the current, next and previous pages (The first and last page only contain the next and prev links respectively).

Useful parameters

Iterate with page and size

You can specify the amount of documents to be returned (per page and the page you want), among other options.

As mentioned in the previous paragraph, we can use pagination to restrict the amount of results. For example we can to obtain the second page (Parameter page)of a query that gets one element (Parameter size) per page:

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=awesome+document&page=2&size=1'

Understand order of results with explain

To better understand Elasticsearch scoring algorithm use the development explain query parameter. It enables explanation for each hit on how its score was computed.

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=example&explain=true'

Results will be seen under the field explanation in the result.

Toggle to expand example response

{
  "aggregations": {},
  "hits": {
    "hits": [
      {
        "created": "2018-03-19T08:16:53.218017+00:00",
        "explanation": {
          "value": 1.6943599,
          "description": "weight(_data:field:example in 0) [PerFieldSimilarity], result of:",
          "details": [
             {
                "value": 1.6943599,
                "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                "details": [
                   {
                      "value": 1.3862944,
                      "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details": [
                         {
                            "value": 1.0,
                            "description": "docFreq",
                            "details": []
                         },
                         {
                            "value": 5.0,
                            "description": "docCount",
                            "details": []
                          }
                       ]
                   },
                    {
                      "value": 1.2222223,
                      "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details": [
                         {
                            "value": 1.0,
                            "description": "termFreq=1.0",
                            "details": []
                         },
                         {
                            "value": 1.2,
                            "description": "parameter k1",
                            "details": []
                         },
                         {
                            "value": 0.75,
                            "description": "parameter b",
                            "details": []
                         },
                         {
                            "value": 5.4,
                            "description": "avgFieldLength",
                            "details": []
                         },
                         {
                            "value": 3.0,
                            "description": "fieldLength",
                            "details": []
                         }
                      ]
                   }
                ]
             }
          ]
       }
        ...
      }
    ],
    "total": 2
  },
  "links": {
    "prev": "http://<host:port>/api/records/?page=1&size=1",
    "self": "http://<host:port>/api/records/?page=2&size=1"
  }
}

Check matched terms with highlight

From Elasticsearch docs on highlighting:

Highlighters enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are. When you request highlights, the response contains an additional highlight element for each search hit that includes the highlighted fields and the highlighted fragments.

Send list of fields you want to highlight in the following format:

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=Lorem+ipsum+dolor&highlight=_data.*&highlight=other_field'

You can send wildcards to match all inner fields of some field, ie. _data.*.

Field highlight will follow this format:

    ...

    "highlight": {
        "_data.my_field": [
            "<em>Lorem</em> <em>ipsum</em> <em>dolor</em> sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et",
            "Duis aute irure <em>dolor</em> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur"
        ]
    }

    ...

Advanced queries

The query parameter q or query specified in the URL can contain boolean expressions. For example q=+value1 +value2 will match only when both value1 and value2 are matched. Be careful to url encond the + sign.

In addition, it accepts queries per field in the format of field:value, where value can be a string or a boolean expression. In the later case it has to be surrounded by parenthesis. For example: q=field_x.field_y:quick +(field_3:brown field_3:fox) which will match when either brown or fox match field3 and a match quick in field field_x.field_y is optional. Can also be written as q=field_x.field_y:quick +field_3:(brown || fox).

You can also manipulate the weights in the query, ie. q=field1:value1 field2:(value1)^2.

From ES documentation, some important notes regarding boolean operators:

By default, all terms are optional, as long as one term matches. A search for foo bar baz will find any document that contains one or more of foo or bar or baz. We have already discussed the default_operator above which allows you to force all terms to be required, but there are also boolean operators which can be used in the query string itself to provide more control.

The preferred operators are + (this term must be present) and - (this term must not be present). All other terms are optional. For example, this query:

quick brown +fox -news

states that:

  • fox must be present
  • news must not be present
  • quick and brown are optional — their presence increases the relevance

The familiar operators AND, OR and NOT (also written &&, || and !) are also supported. However, the effects of these operators can be more complicated than is obvious at first glance. NOT takes precedence over AND, which takes precedence over OR. While the + and - only affect the term to the right of the operator, AND and OR can affect the terms to the left and right.

Refines or filters

Field specific queries are very helpful to provide filters, mostly when you know it can take a definite set of values. For example by document type (doc, ppt, pdf, etc.).

Escape special characters

Remember to escape reserved characters if they have to be searched for and not interpreted. The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /. In addition, if you are running the query from the terminal, it might be necessary to encode these characters. Lastly substitute spaces by + so the URL is properly formatted.

For more details and examples check the ES documentation on Query DSL » Full text queries » Query String Query. Currently we only support the field query (not fields nor default_field and so on).

Scoring type

Use query parameter type to change how the query matches and scores documents.

Values are those permitted by Elasticsearch as documented in valid values for type. Defaults to best_fields.

Example:

curl -X GET -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=author+book+year&type=cross_fields'

Wildcards

There are two characters that can be use as wildcards. The question mark, ?, which would imply any value but just one time. For example: val?e would match value and val8e but not valuu88e. The later would be matched by val*e, which implies any value and any amount of times (including 0, so it would also match vale). In the case that you have to query for ? or * as a value and not as a wildcard, you just have to be escape them.

Most of the wildcard queries have to be done over a keyword analyzed field, meaning "exact match". Assuming a field named "wildtest" defined as:

"wildtest": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword"
    }
  }
}

Where a record with wildtest: ADB15*/test exists, we could perform the following queries:

  • With a wildcard: q=wildtest:ADB.
  • Looking for the * character: q=wildtest:ADB15\*.

Due to the way text data is analized with the default analyzer, more complex queries need to be done over the keyword argument:

  • Match the beginning of the word, followed by a wildcard, followed by an exact match of the * and \ characters: q=wildtest.keyword:ADB*\*\/*. The first and last * are not escaped, therefore they work as a wildcard. The / is escaped because it is a special character. The other * is escaped so it looks for the character * in the string.
  • A similar variation from the above but using a wildcard at the beggining. In addition, the ? is present. It works as a wildcard but of only one character (therefore two are needed to account for 1 and 5 in the string): q=wildtest.keyword:*DB??\*\/*.

Encode characters in your terminal

If you execute these queries via curl or similar terminal tools, be aware that some charaters need to be url-encoded (e.g. \ is %5C, / is %2F).

Range and date queries

When querying for dates you can query for a range (appart from an exact one). The syntax follows the elasticsearch DSL guidelines.

Assuming your document has a field called modification_date with which is of type date with format YYYY-MM-DD'T'HH:mm:ss, a query for the documents modified (since it's the modification date field) in a precise time would be:

$ curl -k -XGET -H "Content-Type: application/json" 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/records/?q=modification_date:2019-08-05T17%5C:26%5C:15"

Encode and escape characters in your terminal

If you execute these queries via curl or similar terminal tools, be aware that some charaters need to be url-encoded (e.g. \ is %5C). In addition, : need to be escaped in order not to in interpreted as a field : value comparisson.

For range queries you can use {} and [], being the first an exclusive range and the second one an inclusive range. This means you can query modification_date:{* TO "2019-01-01T12:00:00"} and modification_date:[* TO "2019-01-01T12:00:00"]. The first one will return all documents whose modification date was previous (*) to the first of January 2019 at mid-day, while the second query will include also those created at mid-day.

Again if you run these queries in a terminal you will need to urlencode them first.

Update documents

To update a document you need to perform a PUT operation. Therefore, the ID or control_number of the record is part of the URL.

curl -X PUT -H 'Content-Type: application/json' -H 'Accept: application/json' \
    -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/record/<ID>' --data '
        {   
          "description": "This is an awesome updated description",
          "title": "Update Test 1",
          "class": "Z"
        }
        '

Documents can also be partially updated. In this case, you need to perform a PATCH. The endpoint accepts application/json+patch as Content-Type. An example to update the description would be:

curl -k -X PATCH -H 'Content-Type: application/json-patch+json' -H 'Accept: application/json' \
    -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'https://<host:port>/api/record/9' --data '
    [
      {
        "op": "replace",
        "path": "/description",
        "value": "Description changed with patch partial update"}
    ]'

Path and trailing slashes

Be careful not to leave a final trailing slash in the path attribute, because the query will fail since it does not exist.

JSON Patch

More about the options of json+patch can be found here.

Update By Query

There might be some cases where there are nested documents that need to be updated depending on a value/condition. For instance, imagine a document that contains the key "articles", whose value is an array of "title" and "description":

{
  "id": 1,
  "articles": [
    {
      "title": "Article one",
      "description": "It was a really good article"
    },
    {
      "title": "Article one, part two",
      "description": "It was a really bad article"
    },
    {
      "title": "No imagination for titles",
      "description": "Another random description"
    },
  ]
}

As you can see in order to update some value in the field "articles" there are three options:

  • Use the PUT operation and send the full new document.
  • Use the PATCH operation and send the replace of the whole "articles" field.
  • Use the update by query endpoint. 'https://<host:port>/api/ubq/<record_id>', where record id is the record control_number, the ID that would be use if you wanted to do a GET of that document.

The difference of this endpoint with respect to the other two is that you will send how to modify the field/document. Giving you more flexibility in the kind of updates you can perform. You need to send the script that will be executed in the ubq field in the body of the requests.

For example:

{
    "ubq" : {
      "script": {
        "source": "for (item in ctx._source.articles){if (item.title == 'No imagination for titles'){item.content='Updated content via RESTful API script'}}",
        "lang": "painless"
      }
    },
    "$schema": "http://<host:port>/schemas/<search_instance>/<index_doc_type>.json"
}

You can take the JSON example above and change the content of the field "script" to make your own. However, in most of the cases changing just the fields and the updated values should suffice.

This endpoint accepts all kind of scripts that are accepted by the Update By Query operation in Elasticsearch.

Disclaimer

This operation might produce inconsistencies between Elasticsearch and PostgreSQL storage (indexed data vs historic/backup data) if your script does not function as expected. The Search service is not responsible for any data corruption or inconsistency produced by these operations. Please test your scripts in a local ES instance before proceeding. Although, the platform is able to detect when these inconsistency errors occur and will inform you, it is not able to fix them since there is no way to know which data is the correct one. The fastest way of solving the inconsistency issues is to delete the document and index it again.

Info

As of 01-03-2019 bulk updates by query, meaning for more than one document at a time, are not yet supported. We are working to enable this.

Delete documents

To delete a document you need to perform a DELETE operation. For this you simply need to specify the document ID or control_number in the url:

curl -XDELETE -H 'Content-Type: application/json' -H 'Accept: application/json' \
  -H 'Authorization:Bearer <ACCESS_TOKEN>' -i 'http://<host:port>/api/record/<ID>'

If afterwards you query (GET, PUT, PATCH, DELETE) for the specific item you will obtain a 410:

{
  "status": 410, 
  "message": "PID has been deleted."
}