The Great Mapping Refactoring
One of the biggest sources of pain for users of Elasticsearch today is the ambiguity that exists in type and field mappings. This ambiguity can result in exceptions at index time, exceptions at query time, incorrect results, results that change from request to request, and even index corruption and data loss.
In the quest to make Elasticsearch more solid and predictable, we have made a number of changes to make field and type mappings stricter and more reliable. In most cases, we only enforce the new rules when creating new indices in Elasticsearch v2.0, and we have provided a backwards compatibility layer which will keep your old indices functioning as before.
However, in certain cases, such as in the presence of conflicting field mappings as explained below, we are unable to do so.
You will not be able to upgrade indices with conflicting field mappings to Elasticsearch v2.0.
If the data in these indices is no longer needed, then you can simply delete the indices, otherwise you will need to reindex your data with correct mappings.
Changing how mappings work is not a decision that we take lightly. Below I explain the problems that exist today and the solutions that we have implemented.
In the past we have said that document types are “like tables in a database”, which is a nice simple way to explain their purpose. Unfortunately, it is just not true: fields with the same name in different document types in the same index map to the same Lucene field name internally.
If you have an error
field which is mapped as an integer in the apache
document type and as a string in the nginx
document type, Elasticsearch will end up mixing numeric and string data into the same Lucene field. Trying to search or aggregate on this field will either return wrong results, throw an exception, or even corrupt your index.
To resolve this problem, we first considered prefixing field names with the document type name, to make each field completely independent. The advantage of this approach is that document types would behave as real tables.
Unfortunately, it comes with a number of disadvantages:
- Fields would always require a document type prefix to disambiguate a field in one type from another, or a wildcard to query the same field in multiple types.
- Querying the same field name across document types would be slower as each field would have to be queried separately.
-
Most searches would require using multi-field queries instead of the simpler
match
andterm
queries, which would break most existing queries. - Heap usage, disk usage, and I/O would increase as there would be more sparsity and poorer compression.
- Aggregations across document types would become much slower and more memory hungry as they wouldn’t be able to take advantage of global ordinals.
The solution
In the end, we opted to enforce the rule that all fields with the same name in the same index must have the same mapping, although certain parameters such as copy_to
and enabled
are still allowed to be specified per-type. This resolves the issues with data corruption, query time exceptions, and incorrect results. Queries and aggregations remain as fast as they are today, and we can maximise compression and minimise heap and disk usage.
The disadvantage of this solution is that users who have treated types as completely separate tables need to revise their approach. This is not as problematic as it sounds. In reality, most field names represent a certain type of data: a created_date
is always going to be a date, a number_of_hits
field is always going to be numeric. Users who have conflicting field mappings today are either getting wrong results or losing their data to corruption. The real difference is that we’re now enforcing correct behaviour at index time, instead of relying on the user to follow best practice.
While the majority of users have non-conflicting field mappings, there are occasions when conflicts exist, so what techniques are available to deal with these situations? There are a few solutions:
This is the simplest solution. Indices are completely independent of each other and so behave like real database tables. Cross-index queries work just as well as cross-type queries, and cross-index sorting and aggregations will continue to work as long as the fields being queried have the same data type — the same limitation that applies today.
When only a few conflicts exist, they can be resolved by changing field names to be more descriptive, either in the application or using Logstash. For instance, two error
fields could be renamed to error_code
and error_message
respectively.
Fields in different document types are allowed to have different copy_to
settings. The original error
field can have index
set to no
to essentially disable it across all document types, but the value of the error
field can be copied to the integer error_code
field in one type:
PUT my_index/_mapping/mapping_one { "properties": { "error": { "type": "string", "index": "no", "copy_to": "error_code" }, "error_code": { "type": "integer" } } }
and to the string error_message
field in another type:
PUT my_index/_mapping/mapping_two { "properties": { "error": { "type": "string", "index": "no", "copy_to": "error_message" }, "error_message": { "type": "string" } } }
A similar solution can be achieved with multi-fields.
Sometimes you have no control over the documents sent to Elasticsearch, and over the fields they contain. Besides the potential for conflicts, blindly accepting whatever fields your users send you can result in a mapping explosion. Imagine what happens with documents which use a timestamp or an IP address as a field name.
Instead, a separate nested
field can be used for each data type, such as str_val
, int_val
, date_val
, etc. In order to use this approach, a document like this:
{ "message": "some string", "count": 1, "date": "2015-06-01" }
would need to be reformatted by the application into:
{ "data": [ {"key": "message", "str_val": "some_string" }, {"key": "count", "int_val": 1 }, {"key": "date", "date_val": "2015-06-01" } ] }
While this solution requires more work on the application side, it solves both the conflict problem and the mapping-explosion problem at the same time.
Today, it is possible to refer to fields using their “short name”, the full path, or the full path prefixed by the document type. These options lead to ambiguity. Take this mapping for example:
{ "mappings": { "user": { "properties": { "title": { "type": "string" } } }, "blog": { "properties": { "title": { "type": "string" }, "user": { "type": "object", "fields": { "title": { "type": "string" } } } } } } }
- Does
title
refer touser.title
,blog.title
, orblog.user.title
? - Does
user.title
refer touser.title
orblog.user.title
?
The answer is: it depends which one Elasticsearch finds first. The field that is selected could even change from request to request, depending on how the mappings are serialised on each node.
In 2.0, you will have to use the full path name without the document type prefix to refer to fields:
user.title
maps only to theuser.title
field in theblog
type,title
maps to thetitle
field inuser
and inblog
.*title
will matchuser.title
and bothtitle
fields.
How would we differentiate between the title
field in user
and the title
field in blog
?
We don’t have to. Because of the change explained in Conflicting field mappings, the mapping for the title
field is the same in both types. In essence, there is only one field called title
.
The type prefix (user.
or blog.
) used to have the side effect of filtering by the specified type. Querying the field blog.title
would find only documents of type blog
, not documents of type user
. This syntax is misleading because it doesn’t work everywhere: aggregations or suggestions could contain results from any type. For this reason, plus the ambiguity demonstrated above, the type prefix is no longer supported.
IMPORTANT: You will need to update any percolators which make use of short names or the type prefix.
Every type has meta-fields like _id
, _index
, _routing
, _parent
, _timestamp
, most of which supported various configuration options like index
, store
, or path
. We have simplified these settings considerably.
_id
and_type
are no longer configurable._index
may beenabled
which will store the index name with the document._routing
may be marked asrequired
only._size
may beenabled
only._timestamp
is stored by default.-
The
_boost
and_analyzer
fields have been removed, and will be ignored on old indices.
It used to be possible to extract the _id
, _routing
, and _timestamp
values from fields in the document body. This functionality has been removed because they required two rounds of document parsing and could result in conflicts. Instead, these values must be set explicitly in the URL or query string instead.
With the exception of the _boost
and _analyzer
fields, the existing meta-field configuration will be respected on old indices.
It used to be possible to specify index and search analyzers at index level, at type level, at field level, and even at document level (with the _analyzer
field). Combining tokens from different analysis chains into the same field results in bad relevance scores. With the move to disallowing conflicting field mappings, we have simplified the analysis settings considerably:
-
Each analyzed string field accepts an
analyzer
setting and asearch_analyzer
setting (which defaults to value of theanalyzer
setting). Theindex_analyzer
setting has been replaced byanalyzer
. - If a field with the same name exists in multiple types, all copies of the field must have the same analyzer settings.
-
The type level default
analyzer
,index_analyzer
, andsearch_analyzer
settings are no longer supported. -
Default analyzers may be set per-index in the index
analysis
settings, by naming themdefault
ordefault_search
. -
The per-document
_analyzer
field is no longer supported and will be ignored on existing indices.
The index_name
and path
settings have been removed in favour of copy_to
, which has been available since Elasticsearch v1.0.0. They will continue to work on existing indices, but will no longer be accepted when creating new indices.
Today, when indexing a document which contains a previously unseen field, the field is added to the mapping locally, then forwarded to the master, which issues a cluster update to inform all shards of the new mapping. It is possible that the same field could be added to two shards at the same time. It is also possible that these two mappings could be different: one could be a double
while the other is a long
, or one could be a string
while the other is a date
.
In these cases, the first mapping to arrive to the master node wins. However, the shard with the “losing” mapping is already using a different data type, and will continue to use it. Later on, perhaps after restarting that node, the shard is moved to a different node and the official mapping from the master is applied. This results in index corruption and data loss.
To prevent this, a shard will now wait for the master to accept the new mapping before allowing indexing to continue. This makes all mapping updates deterministic and safe. Indexing a document containing new fields may be somewhat slower than before, because of the need to wait for acceptance, but the speed of cluster state updates has been greatly improved by two new features:
- Cluster state diffs: Whenever possible, only changes to the cluster state are published instead of the entire cluster state.
- Async shard info requests: During the shard allocation process, the master node sends a request to the data nodes to find out which ones have the most up to date copies of unassigned shards. This used to be a blocking call which would delay changes to the cluster state. As of v1.6.0, this request happens asynchronously in the background, allowing pending tasks like mapping updates to be processed more quickly.
Deleting mappings
Finally, it is no longer possible to delete a type mapping (along with the documents of that type). Even after removing a mapping, information about the deleted fields continues to exist at the Lucene level, which can cause index corruption if fields of the same name are added later on. You can either leave mappings as they are or reindex your data into a new index.
Preparing for 2.0
Determining whether you have conflicting mappings or not can be tricky to do by hand. We have provided the Elasticsearch Migration Plugin to help you to figure it out, and to inform you about some features that you are currently using which have been deprecated or removed in 2.0.
If you have conflicting mappings, you will either need to reindex your data into a new index with correct mappings, or to delete old indices that you no longer need. You will not be able to upgrade to 2.0 without resolving these conflicts.