Skip to content

Documents, collections and indexes

In this section we will show the process of converting your document description into a schema. Lets assume your document has three fields: class, title and description, being all of them of text type (non-limited array of characters, a string). Its schema and mapping would be as follows.

The default field for global search, meaning not over specified fields, is _data. Therefore, you must specify it and put inside all fields that are meant to be queried when performing global searches.

JSON schema

{
  "title": "Custom record schema v0.0.1",
  "id": "http://<host:port>/schemas/instance/test-doc_v0.0.1.json",
  "$schema": "http://<host:port>/schemas/instance/test-doc_v0.0.1.json",
  "type": "object",
  "properties": {
    "_access": {
      "type": "object",
      "properties": {
        "owner": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "read": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "update": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "delete": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      }
    },
    "_data": {
      "type": "object",
      "properties": {
        "title": {
          "type": "string",
          "description": "Record title."
        },
        "description": {
          "type": "string",
          "description": "Description for record."
        },

      }
    },
    "class": {
      "type": "string",
      "description": "Record class."
    },
    "control_number": {
      "type": "string"
    },
    "$schema": {
      "type": "string"
    }
  }
}

Elasticsearch mapping

{
  "settings": {
    "index.percolator.map_unmapped_fields_as_string": true,
    "index.mapping.total_fields.limit": 100
  },
  "mappings": {
    "test-doc_v0.0.1": {
      "numeric_detection": true,
      "properties": {
        "_access": {
          "type": "nested",
          "properties": {
            "owner": {
              "type": "keyword"
            },
            "read": {
              "type": "keyword"
            },
            "update": {
              "type": "keyword"
            },
            "delete": {
              "type": "keyword"
            }
          }
        },
        "_data": {
          "title": {
            "type": "keyword",
            "fields": {
              "english": {
                "type": "text",
                "analyzer": "english"
              },
              "french": {
                "type": "text",
                "analyzer": "french"
              }
            }
          },
          "description": {
            "type": "text",
            "fields": {
              "english": {
                "type": "text",
                "analyzer": "english"
              },
              "french": {
                "type": "text",
                "analyzer": "french"
              }
            }
          }
        },
        "class": {
          "type": "keyword"
        },
        "control_number": {
          "type": "keyword"
        },
        "$schema": {
          "enabled": false
        }
      }
    }
  }
}

As you can see in both have a similar structure with many extra fields apart from the three we mentioned before (class, title and description). You do not have to worry about them, except for _access, which is used for access control. How they work is explained in more detail in the access control and permissions section.

The rest of the extra fields and attributes will be explained in the following paragraphs. Although they are shown and explained here for information, you do not need to manage them. These will be set by us to give the best performance possible according to your use case definition (e.g. Exact match on a field, Full text in English or French).

The search type is specified in the type field of each attribute. When keyword is specified, it means that searches upon this field will produce a match only if the query string is exactly the same than the field value. This is the case for title, where the value "CERN Search" will produce a match only for the query string "CERN Search" (unless wildcards are applied). The type text, on the other hand, will be preprocessed for the specified language (in this case French and English). Therefore the query string "fox" will produce a match for both "fox" and "foxes".

You will also note that the field class is not under the _data attribute. This means that it will only be available if the query specifies it as the field to query. This is useful for filtering, for example to obtain only those with record class "dummy". Note that full text searches will be performed against the _data field, and therefore only fields under it will produce hits. More about this will be explained in the querying documents section.

Lastly, you can see the attribute enabled in the field $schema. Since it is set to "false" it will be stored but not indexed. This is always the case, since we just need the schema value to be there for in-application tasks, but it is not relevant for searching.

Naming conventions

Fields can get any name as long as they do not collide with Elasticsearch reserved keywords.

Validating the deployed schema

You can use this endpoint to get the deployed schema:

https://<host:port>/schemas/instance/test-doc_v0.0.1.json

Standards / Recommendations

In order for results to be showed and searchable in the main search (optional) you must adhere to the following standards:

1) Mapping should contain the following fields

  • _data.name or _data.title or file
  • _data.url: eg. https://indico.cern.ch/event/1125222/
  • _data.site: eg. indico.cern.ch

Optional fields (will be shown if exists):

  • _data.authors
  • _data.description or _data.content
  • _data.keywords
  • _data.collection, eg. File, Web page
  • _data.type, eg. PDF, Official

2) ACLS For private documents to be showing in the main search, based on SSO ACLs, please follow the standards in permissions. If not followed, only public results will be shown.

Example mappings can be seen in /examples.

3) Recommendations

Official Collections:

  • Web page
  • File
  • Media

Official Types (for each collection):

  • Collection Web page:

    • Official
    • Personal
    • Test
    • Documentation
  • Collection Media:

  • Video

  • Collection File (read more under files):

    • Document: ["doc", "docx", "odt", "pages", "rtf", "tex", "wpd", "txt"]
    • PDF: ["pdf"]
    • Sheet: ["ods", "xlsx", "xlsm", "xls", "numbers"]
    • Slides: ["ppt", "pptx", "pps", "odp", "key"]
    • Other