Enable parsing columns from file path for Broker Load (#1582)#1635

yuanlihan · 2019-08-13T16:43:21Z

Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv

This patch is able to parse columns from file path like in Spark(Partition Discovery).

This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.

(I'm sorry to create a new pr about this issue for being not familiar with git rebase )

imay · 2019-08-14T02:37:14Z

gensrc/thrift/PlanNodes.thrift

    // total size of the file
    8: optional i64 file_size
+    // columns parsed from file path
+    9: optional list<string> columns_from_path


It's better to record the slot offset for columns_from_path or name it num_of_columns_from_file. Because we may add other columns_from_xxx, we should make it definite.
And you should comment that columns_from_path is after the columns read from file.

imay · 2019-08-14T02:40:10Z

be/src/exec/broker_scanner.cpp

+    str_slot->len = value.size;
+}
+
+inline void BrokerScanner::fill_slots_of_columns_from_path(int start, const std::vector<SlotDescriptor*>& src_slot_descs, Tuple* tuple) {


Why don't put this function to BaseScanner to avoid write this twice.
And if you have num_columns_from_file, this function don't need start param.

imay · 2019-08-14T02:52:10Z

be/src/exec/parquet_reader.cpp

+                    } else {
+                        time_t timestamp = (time_t)((int64_t)ts_array->Value(_current_line_of_group) * 24 * 60 * 60);
+                        tm* local;
+                        local = localtime(&timestamp);


use localtime_r which is thread-safe

imay · 2019-08-16T07:01:27Z

be/src/exec/parquet_reader.cpp

    _parquet_column_ids.clear();
-    for (auto slot_desc : tuple_slot_descs) {
+    for (int i = 0; i < _num_of_columns_from_file; i++) {
+        auto slot_desc = tuple_slot_descs.at(i);


Suggested change

auto slot_desc = tuple_slot_descs.at(i);

auto slot_desc = tuple_slot_descs[i];

imay · 2019-08-16T07:02:07Z

be/src/exec/parquet_scanner.cpp

        }
        RETURN_IF_ERROR(_cur_file_reader->read(_src_tuple, _src_slot_descs, tuple_pool, &_cur_file_eof));
+        // range of current file
+        const TBrokerRangeDesc& range = _ranges.at(_next_range - 1);


Suggested change

const TBrokerRangeDesc& range = _ranges.at(_next_range - 1);

const TBrokerRangeDesc& range = _ranges[_next_range - 1];

imay · 2019-08-16T07:13:44Z

fe/src/main/java/org/apache/doris/load/BrokerFileGroup.java

            }
        }
+        // columnsFromPath
+        if (Catalog.getCurrentCatalogJournalVersion() >= FeMetaVersion.VERSION_59) {


I think there is no need to persist columnsFromPath. Because Load have already persist all Load statement, and will generate DataDescriptor when restart.

imay · 2019-08-16T07:19:25Z

fe/src/main/java/org/apache/doris/load/Load.java

+            if (dataDescription.getColumnNames() != null) {
+                assignColumnNames.addAll(dataDescription.getColumnNames());
+            }
+            if (dataDescription.getColumnsFromPath() != null) {


If user specify columnFromPath without columnList,
I think we should return user's error message in LoadStatement analyze function.
For here, we should check dataDescription.getColumnsFromPath() in if (dataDescription.getColumnNames() != null) block

imay · 2019-08-16T07:21:52Z

fe/src/test/java/org/apache/doris/common/util/BrokerUtilTest.java

+            fail();
+        }
+
+        path = "/path/to/dir/k2==v2=//k1=v1//xxx.csv";


I think should add directory test case like '/path/to/dir/k2==v2=//k1=v1/' which should return false

imay · 2019-08-16T07:22:09Z

be/src/exec/base_scanner.cpp

 }
+
+void BaseScanner::fill_slots_of_columns_from_path(int start, const std::vector<std::string>& columns_from_path) {
+    if (start <= 0) {


I think this check is useless

I think this check is useless

But we should skip the case of StreamLoadTask

imay

LGTM

…che#1635)

…apache#1656) Author: platoneko <platonekosama@gmail.com> Date: Wed Apr 12 12:24:15 2023 +0800 Use snapshot read to get tablet stats Author: plat1ko <platonekosama@gmail.com> Date: Tue Apr 11 16:02:51 2023 +0800 [selectdb-cloud] Fix incorrect tablet stats in finish tablet job (apache#1635) Author: plat1ko <platonekosama@gmail.com> Date: Fri Apr 7 23:13:06 2023 +0800 [feature](selectdb-cloud) Split tablet stats kv to reduce transaction conflicts (apache#1585) * Split tablet stats to reduce transaction conflicts * Fix mem txn kv and add ut * Add ut for atomic

imay requested changes Aug 14, 2019

View reviewed changes

yuanlihan force-pushed the parse_columns_from_file_path branch 2 times, most recently from 03e6231 to 30e3f0a Compare August 14, 2019 16:40

yuanlihan closed this Aug 15, 2019

yuanlihan reopened this Aug 15, 2019

yuanlihan force-pushed the parse_columns_from_file_path branch 2 times, most recently from 2729227 to abc553c Compare August 15, 2019 10:50

yuanlihan closed this Aug 16, 2019

yuanlihan reopened this Aug 16, 2019

imay reviewed Aug 16, 2019

View reviewed changes

yuanlihan added 13 commits August 18, 2019 23:56

Enable parsing columns from file path for Broker Load

2e2b466

Enable parsing columns from file path for Broker Load

1f45887

Enable parsing columns from file path for Broker Load

1da3654

Enable parsing columns from file path for Broker Load

3c6eced

Enable parsing columns from file path for Broker Load

a6ad01d

Enable parsing columns from file path for Broker Load

d9bbb1b

Format text

55b2b92

Enable parsing columns from file path for Broker Load

030c1dd

Enable parsing columns from file path for Broker Load

2ffaac0

Fix ut

7176ed4

Add ut

8b5061b

Fix ut

7c7c54c

Remove redundant persistence

067b4c5

yuanlihan force-pushed the parse_columns_from_file_path branch from c12e485 to 067b4c5 Compare August 18, 2019 15:59

Remove redundant meta_version

25c4bb0

imay approved these changes Aug 19, 2019

View reviewed changes

imay merged commit ba6d728 into apache:master Aug 19, 2019

imay mentioned this pull request Sep 26, 2019

Release Notes 0.11.0 #1891

Closed

swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request Jul 25, 2023

[selectdb-cloud] Fix incorrect tablet stats in finish tablet job (apa…

189aff7

…che#1635)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Enable parsing columns from file path for Broker Load (#1582)#1635

Enable parsing columns from file path for Broker Load (#1582)#1635
imay merged 14 commits intoapache:masterfrom
yuanlihan:parse_columns_from_file_path

yuanlihan commented Aug 13, 2019

Uh oh!

imay Aug 14, 2019

Uh oh!

imay Aug 14, 2019

Uh oh!

imay Aug 14, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

imay Aug 16, 2019

Uh oh!

yuanlihan Aug 16, 2019

Uh oh!

imay left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	auto slot_desc = tuple_slot_descs.at(i);
	auto slot_desc = tuple_slot_descs[i];

	const TBrokerRangeDesc& range = _ranges.at(_next_range - 1);
	const TBrokerRangeDesc& range = _ranges[_next_range - 1];

Comments

Conversation

yuanlihan commented Aug 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imay left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants