Documents, collections and indexes¶
In this section we will show the process of converting your document description into a schema. Lets assume your document has three fields: class
, title
and description
, being all of them of text type (non-limited array of characters, a string). Its schema and mapping would be as follows.
Full text search¶
The default field for global search, meaning not over specified fields, is _data
.
Therefore, you must specify it and put inside all fields that are meant to be queried when performing global searches.
JSON schema¶
{
"title": "Custom record schema v0.0.1",
"id": "http://<host:port>/schemas/instance/test-doc_v0.0.1.json",
"$schema": "http://<host:port>/schemas/instance/test-doc_v0.0.1.json",
"type": "object",
"properties": {
"_access": {
"type": "object",
"properties": {
"owner": {
"type": "array",
"items": {
"type": "string"
}
},
"read": {
"type": "array",
"items": {
"type": "string"
}
},
"update": {
"type": "array",
"items": {
"type": "string"
}
},
"delete": {
"type": "array",
"items": {
"type": "string"
}
}
}
},
"_data": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Record title."
},
"description": {
"type": "string",
"description": "Description for record."
},
}
},
"class": {
"type": "string",
"description": "Record class."
},
"control_number": {
"type": "string"
},
"$schema": {
"type": "string"
}
}
}
Elasticsearch mapping¶
{
"settings": {
"index.percolator.map_unmapped_fields_as_string": true,
"index.mapping.total_fields.limit": 100
},
"mappings": {
"test-doc_v0.0.1": {
"numeric_detection": true,
"properties": {
"_access": {
"type": "nested",
"properties": {
"owner": {
"type": "keyword"
},
"read": {
"type": "keyword"
},
"update": {
"type": "keyword"
},
"delete": {
"type": "keyword"
}
}
},
"_data": {
"title": {
"type": "keyword",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
},
"french": {
"type": "text",
"analyzer": "french"
}
}
},
"description": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
},
"french": {
"type": "text",
"analyzer": "french"
}
}
}
},
"class": {
"type": "keyword"
},
"control_number": {
"type": "keyword"
},
"$schema": {
"enabled": false
}
}
}
}
}
As you can see in both have a similar structure with many extra fields apart from the three we mentioned before (class
, title
and description
).
You do not have to worry about them, except for _access
, which is used for access control. How they work is explained in more detail in the access control and permissions section.
The rest of the extra fields and attributes will be explained in the following paragraphs. Although they are shown and explained here for information, you do not need to manage them. These will be set by us to give the best performance possible according to your use case definition (e.g. Exact match on a field, Full text in English or French).
The search type is specified in the type
field of each attribute. When keyword
is specified, it means that searches upon this field will produce a match only if the query string is exactly the same than the field value.
This is the case for title
, where the value "CERN Search" will produce a match only for the query string "CERN Search" (unless wildcards are applied).
The type text
, on the other hand, will be preprocessed for the specified language (in this case French and English).
Therefore the query string "fox" will produce a match for both "fox" and "foxes".
You will also note that the field class
is not under the _data
attribute. This means that it will only be available if the query specifies it as the field to query.
This is useful for filtering, for example to obtain only those with record class "dummy". Note that full text searches will be performed against the _data
field, and therefore only fields under it will produce hits.
More about this will be explained in the querying documents section.
Lastly, you can see the attribute enabled
in the field $schema
.
Since it is set to "false" it will be stored but not indexed. This is always the case, since we just need the schema value to be there for in-application tasks, but it is not relevant for searching.
Naming conventions
Fields can get any name as long as they do not collide with Elasticsearch reserved keywords.
Validating the deployed schema¶
You can use this endpoint to get the deployed schema:
https://<host:port>/schemas/instance/test-doc_v0.0.1.json
Standards / Recommendations¶
In order for results to be showed and searchable in the main search (optional) you must adhere to the following standards:
1) Mapping should contain the following fields
_data.name
or_data.title
orfile
_data.url
: eg.https://indico.cern.ch/event/1125222/
_data.site
: eg.indico.cern.ch
Optional fields (will be shown if exists):
_data.authors
_data.description
or_data.content
_data.keywords
_data.collection
, eg.File
,Web page
_data.type
, eg.PDF
,Official
2) ACLS For private documents to be showing in the main search, based on SSO ACLs, please follow the standards in permissions. If not followed, only public results will be shown.
Example mappings can be seen in /examples.
3) Recommendations
Official Collections:
Web page
File
Media
Official Types (for each collection):
-
Collection
Web page
:Official
Personal
Test
Documentation
-
Collection
Media
: -
Video
-
Collection
File
(read more under files):Document
:["doc", "docx", "odt", "pages", "rtf", "tex", "wpd", "txt"]
PDF
:["pdf"]
Sheet
:["ods", "xlsx", "xlsm", "xls", "numbers"]
Slides
:["ppt", "pptx", "pps", "odp", "key"]
Other