ES 源码分析之数据类型转换

科技 06-27 来源：一米雪碧钱钱

公司有的小伙伴问我，为什么不推荐我们使用 nested 结构呢，还说性能低。那么，ES 针对 nested 之类的结构。因为ES 源码我也基本看完了。索性，直接写成笔记。比直接在代码里面写注释来的更舒心点。

1问题描述

ES 是 lucene 不仅仅是集群版的概念，还有涉及到支持丰富的数据类型。如 nested 、object 等等结构。它是怎么支持的呢？
ES 还支持 _id、_version 等等字段。这种是怎么存储的呢？
听说 ES 的 parent doc 和 nested doc 是分开来存储的，那么获取的时候，他们是通过哪种关系关联的呢？

2类型转换

2.1初步代码入口

代码具体入口 org.elasticsearch.index.shard.IndexShard#prepareIndex

    public static Engine.Index prepareIndex(DocumentMapperForType docMapper, SourceToParse source, long seqNo,
                                            long primaryTerm, long version, VersionType versionType, Engine.Operation.Origin origin,
                                            long autoGeneratedIdTimestamp, boolean isRetry,
                                            long ifSeqNo, long ifPrimaryTerm) {
        long startTime = System.nanoTime();

        // 涉及到 nested 等等结构的转换，直接看【2.2 类型具体转换代码】
        ParsedDocument doc = docMapper.getDocumentMapper().parse(source);

        // Mapping 是否要处理
        if (docMapper.getMapping() != null) {
            doc.addDynamicMappingsUpdate(docMapper.getMapping());
        }

        // _id 转 uid。这里是为了数据能保持整齐，方便压缩。可以参考 【哈夫曼编码】。
        Term uid = new Term(IdFieldMapper.NAME, Uid.encodeId(doc.id()));

        return new Engine.Index(uid, doc, seqNo, primaryTerm, version, versionType, origin, startTime, autoGeneratedIdTimestamp, isRetry,
            ifSeqNo, ifPrimaryTerm);
    }

2.2类型具体转换代码

    /**
     * 内部转换文档，如果有 nested 结构，需要再次转换一下
     * @param mapping
     * @param context
     * @param parser
     * @throws IOException
     */
    private static void internalParseDocument(Mapping mapping, MetadataFieldMapper[] metadataFieldsMappers,
                                              ParseContext context, XContentParser parser) throws IOException {
        final boolean emptyDoc = isEmptyDoc(mapping, parser);

        /**
         * 预处理，为 root document 拆开，添加如下：比如，_id、_version 也是一个 document，具体看下面的 【2.3 支持 _id 之类的字段】
         */
        for (MetadataFieldMapper metadataMapper : metadataFieldsMappers) {
            metadataMapper.preParse(context);
        }

        if (mapping.root.isEnabled() == false) {
            // entire type is disabled
            parser.skipChildren();
        } else if (emptyDoc == false) {
            // 转换对象或者 nested 结构，这个方法会反复递归调用。主要是 object 结构或者 nested 结构
            parseObjectOrNested(context, mapping.root);
        }

        // 为各个非 root document 添加 _version 等等字段
        for (MetadataFieldMapper metadataMapper : metadataFieldsMappers) {
            metadataMapper.postParse(context);
        }
    }

2.3前置处理之支持_id之类的字段

代码位置：org.elasticsearch.index.mapper.MetadataFieldMapper#preParse
下面只贴出 _id 的处理

    /**
     * _id 也是一个 doc
     * @param context
     */
    @Override
    public void preParse(ParseContext context) {
        BytesRef id = Uid.encodeId(context.sourceToParse().id());
        context.doc().add(new Field(NAME, id, Defaults.FIELD_TYPE));
    }

这里只是了其中的一个例子：_id ，其他的比如 _version、_seqno、_source 等等处理也类似。

2.4转换复杂的结构，比如nested结构

ES 在转换 nested 结构的时候，比较有意思。

2.4.1类型转换整体入口

    /**
     * 转换 object 或者 nested 结构的，这里会出现递归调用，主要是为了解决 object、nested 结构
     * @param context
     * @param mapper
     * @throws IOException
     */
    static void parseObjectOrNested(ParseContext context, ObjectMapper mapper) throws IOException {
        if (mapper.isEnabled() == false) {
            context.parser().skipChildren();
            return;
        }
        XContentParser parser = context.parser();
        XContentParser.Token token = parser.currentToken();
        if (token == XContentParser.Token.VALUE_NULL) {
            // the object is null ("obj1" : null), simply bail
            return;
        }

        String currentFieldName = parser.currentName();
        if (token.isValue()) {
            throw new MapperParsingException("object mapping for [" + mapper.name() + "] tried to parse field [" + currentFieldName
                + "] as object, but found a concrete value");
        }

        ObjectMapper.Nested nested = mapper.nested();
        // 如果是 nested 结构，每次都会new 一个空白的 document ，而且，这个方法 #{innerParseObject}，是递归实现,把 object 或者 document 变成多个 document
        if (nested.isNested()) {
            // 进入下方的：【2.4.2 nested 转换初步入口】
            context = nestedContext(context, mapper);
        }

        // if we are at the end of the previous object, advance
        if (token == XContentParser.Token.END_OBJECT) {
            token = parser.nextToken();
        }
        if (token == XContentParser.Token.START_OBJECT) {
            // if we are just starting an OBJECT, advance, this is the object we are parsing, we need the name first
            token = parser.nextToken();
        }

        // 转换对象
        innerParseObject(context, mapper, parser, currentFieldName, token);

        // restore the enable path flag
        if (nested.isNested()) {
            nested(context, nested);
        }
    }

2.4.2nested转换初步入口

    /**
     * 内部转换 nested 结构，生成一个空白的 nested 结构
     * TODO nested 文档的 _id 既然跟父文档的一样，lucene 写入每个 doc ，都是拼接。那么，在get 的时候，自然会获取到相同的 _id 多个文档，包含了 nested 结构。然后，再内部转换为我们 最想要的结果。
     * @param context
     * @param mapper
     * @return
     */
    private static ParseContext nestedContext(ParseContext context, ObjectMapper mapper) {

        // 创建 nested 上下文，并且，new 一个空白的 document。为后面的 nested 的字段或者对象之类的，全部加上
        context = context.createNestedContext(mapper.fullPath());

        ParseContext.Document nestedDoc = context.doc();
        ParseContext.Document parentDoc = nestedDoc.getParent();

        // We need to add the uid or id to this nested Lucene document too,
        // If we do not do this then when a document gets deleted only the root Lucene document gets deleted and
        // not the nested Lucene documents! Besides the fact that we would have zombie Lucene documents, the ordering of
        // documents inside the Lucene index (document blocks) will be incorrect, as nested documents of different root
        // documents are then aligned with other root documents. This will lead tothe nested query, sorting, aggregations
        // and inner hits to fail or yield incorrect results.
        IndexableField idField = parentDoc.getField(IdFieldMapper.NAME);
        if (idField != null) {
            // We just need to store the id as indexed field, so that IndexWriter#deleteDocuments(term) can then
            // delete it when the root document is deleted too.
            nestedDoc.add(new Field(IdFieldMapper.NAME, idField.binaryValue(), IdFieldMapper.Defaults.NESTED_FIELD_TYPE));
        } else {
            throw new IllegalStateException("The root document of a nested document should have an _id field");
        }

        // the type of the nested doc starts with __, so we can identify that its a nested one in filters
        // note, we don't prefix it with the type of the doc since it allows us to execute a nested query
        // across types (for example, with similar nested objects)
        nestedDoc.add(new Field(TypeFieldMapper.NAME, mapper.nestedTypePathAsString(), TypeFieldMapper.Defaults.NESTED_FIELD_TYPE));
        return context;
    }

仔细看看里面的英文。主要的一点是：nested 结构的 _id 和 parent 的 _id 保持一致。那么，通过 GET docId 这种操作，就可以拿到所有的文档了。而且，删除的时候，特别的方便。算是 ES 这种的一个方案吧。

2.4.3数据处理

每个字段的填充入口在：org.elasticsearch.index.mapper.DocumentParser#innerParseObject
这里是一个递归调用的操作。比较绕。

2.5后置处理之设置_version等等

下面贴出来 _version 的处理
代码的入口：org.elasticsearch.index.mapper.VersionFieldMapper#postParse，可以看看具体的实现。

   @Override
    public void postParse(ParseContext context) {
        // In the case of nested docs, let's fill nested docs with version=1 so that Lucene doesn't write a Bitset for documents
        // that don't have the field. This is consistent with the default value for efficiency.
        Field version = context.version();
        assert version != null;
        for (Document doc : context.nonRootDocuments()) {
            // 为此 doc 添加一个 _version 字段
            doc.add(version);
        }
    }

这里支持举了 _version 举个例子，其他类似。

3总结

ES 是 lucene 不仅仅是集群版的概念，还有涉及到支持丰富的数据类型。如 nested 、object 等等结构。它是怎么支持的呢？
答：ES 针对 nested 、object 直接拍平处理
ES 还支持 _id、_version 等等字段。这种是怎么存储的呢？
答：ES 针对 _id 、_version 是保存为独立的文档的。
听说 ES 的 parent doc 和 nested doc 是分开来存储的，那么获取的时候，他们是通过那种关系关联的呢？
答：通过 root Doc 的 ID 来做关联的。