Skip to content

[feature](inverted index) add custom analyzer support with pinyin tokenzer and pinyin filter#57097

Merged
airborne12 merged 1 commit intoapache:masterfrom
Ryan19929:custom-analysis-pinyin
Nov 5, 2025
Merged

[feature](inverted index) add custom analyzer support with pinyin tokenzer and pinyin filter#57097
airborne12 merged 1 commit intoapache:masterfrom
Ryan19929:custom-analysis-pinyin

Conversation

@Ryan19929
Copy link
Contributor

@Ryan19929 Ryan19929 commented Oct 17, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

apache/doris-website#3067

Support pinyin tokenizer and filter;

  • tokenizer example
 CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS pinyin_tokenizer
  PROPERTIES
  (
                    "type" = "pinyin",
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
  • filter example
 CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS pinyin_filter1
  PROPERTIES
  (
                    "type" = "pinyin", 
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 17, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@github-actions
Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • be/dict/pinyin/polyphone.txt

Consider using git-lfs to manage large files.

@github-actions github-actions bot added the lfs-detected! Warning Label for use when LFS is detected in the commits of a Pull Request label Oct 17, 2025
@Ryan19929 Ryan19929 force-pushed the custom-analysis-pinyin branch from cb1f6a1 to 47f090c Compare October 17, 2025 07:08
@github-actions
Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • be/dict/pinyin/polyphone.txt

Consider using git-lfs to manage large files.

@Ryan19929 Ryan19929 force-pushed the custom-analysis-pinyin branch from 47f090c to d630eaf Compare October 17, 2025 07:27
@github-actions
Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • be/dict/pinyin/polyphone.txt

Consider using git-lfs to manage large files.

@Ryan19929 Ryan19929 changed the title [feat](inverted index) add custom analyzer support pinyin tokenizer and filter [feat](inverted index) custom analyzer support pinyin tokenizer and filter Oct 17, 2025
@Ryan19929
Copy link
Contributor Author

run buildall

@Ryan19929 Ryan19929 changed the title [feat](inverted index) custom analyzer support pinyin tokenizer and filter [feat](inverted index) add custom analyzer support with pinyin tokenzer and pinyin filter Oct 17, 2025
@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/116) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

ClickBench: Total hot run time: 31.89 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d630eafc175a8ccf14d26bc0b79082c60235bf9d, data reload: false

query1	0.07	0.06	0.05
query2	0.11	0.06	0.07
query3	0.26	0.10	0.10
query4	1.63	0.12	0.13
query5	0.29	0.28	0.28
query6	1.20	0.68	0.70
query7	0.03	0.03	0.03
query8	0.07	0.05	0.05
query9	0.67	0.59	0.57
query10	0.63	0.63	0.64
query11	0.19	0.14	0.13
query12	0.18	0.15	0.14
query13	0.65	0.64	0.64
query14	1.08	1.08	1.05
query15	0.93	0.90	0.98
query16	0.46	0.42	0.42
query17	1.12	1.26	1.16
query18	0.24	0.22	0.22
query19	2.02	1.93	1.89
query20	0.02	0.02	0.01
query21	15.37	1.09	0.68
query22	0.79	1.29	0.78
query23	14.74	1.60	0.71
query24	7.46	1.57	0.45
query25	0.29	0.10	0.10
query26	0.70	0.20	0.17
query27	0.08	0.06	0.07
query28	8.81	1.46	0.96
query29	12.57	4.68	3.76
query30	0.30	0.16	0.14
query31	2.84	0.67	0.43
query32	3.25	0.60	0.51
query33	3.10	3.22	3.16
query34	16.28	5.59	4.92
query35	4.96	5.03	4.98
query36	0.72	0.54	0.52
query37	0.12	0.09	0.08
query38	0.08	0.06	0.06
query39	0.04	0.03	0.04
query40	0.19	0.16	0.16
query41	0.10	0.04	0.03
query42	0.04	0.03	0.04
query43	0.05	0.05	0.04
Total cold run time: 104.73 s
Total hot run time: 31.89 s

@Ryan19929
Copy link
Contributor Author

run beut

@Ryan19929 Ryan19929 force-pushed the custom-analysis-pinyin branch from d630eaf to ad1802b Compare October 21, 2025 06:05
@github-actions
Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • be/dict/pinyin/polyphone.txt

Consider using git-lfs to manage large files.

@Ryan19929
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 190545 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ad1802bf107637253b8880fb39d3a655fe39baf5, data reload: false

query1	1019	483	411	411
query2	6574	1716	1657	1657
query3	6755	223	222	222
query4	26529	23472	23317	23317
query5	5236	605	523	523
query6	343	275	234	234
query7	4663	503	308	308
query8	317	271	259	259
query9	8732	2594	2566	2566
query10	555	341	293	293
query11	15298	15072	14965	14965
query12	186	118	117	117
query13	1698	582	456	456
query14	11899	9262	9190	9190
query15	226	192	185	185
query16	7715	677	498	498
query17	1593	754	661	661
query18	2197	457	381	381
query19	243	224	181	181
query20	149	137	134	134
query21	267	148	127	127
query22	4712	4719	4905	4719
query23	35265	33986	33871	33871
query24	8460	2608	2484	2484
query25	595	510	467	467
query26	905	280	186	186
query27	2898	564	362	362
query28	4365	2233	2183	2183
query29	735	656	507	507
query30	308	232	205	205
query31	969	815	812	812
query32	85	80	71	71
query33	597	401	353	353
query34	815	1052	551	551
query35	816	874	782	782
query36	963	1043	916	916
query37	129	114	86	86
query38	3493	3570	3536	3536
query39	1489	1442	1420	1420
query40	234	124	117	117
query41	65	59	59	59
query42	124	110	107	107
query43	499	502	465	465
query44	1225	756	729	729
query45	187	178	175	175
query46	874	989	649	649
query47	1777	1776	1715	1715
query48	414	419	309	309
query49	735	489	429	429
query50	635	691	404	404
query51	4014	3984	3926	3926
query52	116	113	104	104
query53	252	272	188	188
query54	597	601	524	524
query55	85	89	96	89
query56	342	312	312	312
query57	1180	1176	1137	1137
query58	300	288	286	286
query59	2518	2593	2529	2529
query60	344	351	332	332
query61	177	160	171	160
query62	848	742	662	662
query63	226	191	191	191
query64	3766	1272	947	947
query65	4047	4004	3946	3946
query66	1062	445	346	346
query67	15318	14968	14791	14791
query68	9115	906	599	599
query69	508	321	290	290
query70	1347	1295	1226	1226
query71	494	341	321	321
query72	6041	4975	5109	4975
query73	726	584	362	362
query74	9088	9257	8997	8997
query75	4506	3353	2812	2812
query76	3713	1152	726	726
query77	834	403	318	318
query78	9556	9798	9020	9020
query79	2707	819	581	581
query80	713	571	495	495
query81	493	261	225	225
query82	462	163	129	129
query83	299	268	248	248
query84	302	113	95	95
query85	884	461	426	426
query86	346	294	287	287
query87	3755	3745	3667	3667
query88	3101	2238	2217	2217
query89	421	328	300	300
query90	2043	241	215	215
query91	174	176	133	133
query92	90	75	64	64
query93	1212	991	644	644
query94	692	416	347	347
query95	405	328	320	320
query96	492	583	278	278
query97	2978	2975	2882	2882
query98	291	213	210	210
query99	1488	1396	1312	1312
Total cold run time: 281030 ms
Total hot run time: 190545 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.85 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ad1802bf107637253b8880fb39d3a655fe39baf5, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.05	0.05
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.27	0.27	0.25
query6	1.20	0.65	0.65
query7	0.03	0.02	0.03
query8	0.06	0.05	0.05
query9	0.64	0.52	0.52
query10	0.58	0.58	0.59
query11	0.17	0.12	0.12
query12	0.15	0.13	0.12
query13	0.65	0.64	0.61
query14	1.04	1.02	1.02
query15	0.87	0.85	0.85
query16	0.41	0.40	0.39
query17	1.04	1.06	1.06
query18	0.22	0.19	0.20
query19	1.95	1.86	1.83
query20	0.02	0.02	0.01
query21	15.42	0.20	0.13
query22	5.08	0.07	0.06
query23	15.67	0.26	0.10
query24	2.76	1.32	0.89
query25	0.08	0.07	0.06
query26	0.14	0.13	0.13
query27	0.06	0.06	0.06
query28	4.96	1.15	0.93
query29	12.57	3.94	3.25
query30	0.29	0.14	0.13
query31	2.82	0.60	0.40
query32	3.24	0.56	0.48
query33	3.00	3.17	3.11
query34	16.05	5.53	4.80
query35	4.90	4.90	4.93
query36	0.71	0.51	0.50
query37	0.10	0.07	0.06
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.02
query43	0.05	0.03	0.03
Total cold run time: 99.62 s
Total hot run time: 28.85 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 72.46% (1139/1572) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.64% (17998/34190)
Line Coverage 38.00% (163623/430605)
Region Coverage 33.12% (127525/385023)
Branch Coverage 33.84% (54802/161930)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 73.09% (1149/1572) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.45% (23953/33523)
Line Coverage 57.97% (249459/430357)
Region Coverage 53.52% (208721/389981)
Branch Coverage 54.82% (89266/162825)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 68.10% (79/116) 🎉
Increment coverage report
Complete coverage report

@Ryan19929
Copy link
Contributor Author

run p0

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 73.09% (1149/1572) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.43% (23944/33523)
Line Coverage 57.93% (249294/430357)
Region Coverage 53.45% (208440/389981)
Branch Coverage 54.79% (89211/162825)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 68.10% (79/116) 🎉
Increment coverage report
Complete coverage report

@Ryan19929
Copy link
Contributor Author

run feut

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/116) 🎉
Increment coverage report
Complete coverage report

@Ryan19929 Ryan19929 force-pushed the custom-analysis-pinyin branch from ad1802b to 04c6c00 Compare October 23, 2025 06:58
@github-actions
Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • be/dict/pinyin/polyphone.txt

Consider using git-lfs to manage large files.

@Ryan19929
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 28.9 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 04c6c00a578e6b5633dd74decf66b0454639106c, data reload: false

query1	0.06	0.05	0.06
query2	0.10	0.05	0.05
query3	0.25	0.09	0.08
query4	1.62	0.12	0.12
query5	0.29	0.28	0.27
query6	1.19	0.68	0.68
query7	0.03	0.03	0.04
query8	0.07	0.06	0.05
query9	0.64	0.57	0.55
query10	0.62	0.62	0.61
query11	0.18	0.13	0.13
query12	0.17	0.14	0.15
query13	0.64	0.62	0.61
query14	1.03	1.02	1.03
query15	0.88	0.88	0.93
query16	0.41	0.42	0.42
query17	1.11	1.16	1.16
query18	0.24	0.21	0.22
query19	2.00	1.85	1.94
query20	0.01	0.01	0.01
query21	15.38	0.19	0.14
query22	5.10	0.09	0.05
query23	15.64	0.30	0.11
query24	2.66	0.84	0.57
query25	0.09	0.08	0.07
query26	0.16	0.15	0.15
query27	0.07	0.06	0.05
query28	4.53	1.19	0.96
query29	12.57	4.49	3.72
query30	0.30	0.15	0.12
query31	2.84	0.63	0.40
query32	3.25	0.57	0.48
query33	3.32	3.09	3.23
query34	15.97	5.30	4.60
query35	4.64	4.65	4.54
query36	0.71	0.54	0.52
query37	0.11	0.07	0.07
query38	0.08	0.05	0.05
query39	0.04	0.04	0.04
query40	0.17	0.16	0.14
query41	0.10	0.05	0.04
query42	0.05	0.03	0.03
query43	0.05	0.05	0.04
Total cold run time: 99.37 s
Total hot run time: 28.9 s

@doris-robot
Copy link

TPC-DS: Total hot run time: 190528 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f8096bcf3829bddcc921ea2efc42fa276740d569, data reload: false

query1	1076	408	398	398
query2	6575	1780	1702	1702
query3	6765	223	222	222
query4	26071	23864	23708	23708
query5	5642	629	463	463
query6	341	257	235	235
query7	4659	502	299	299
query8	323	268	253	253
query9	8701	2615	2616	2615
query10	546	353	291	291
query11	15577	15094	14809	14809
query12	185	122	115	115
query13	1699	571	437	437
query14	11422	9303	9309	9303
query15	206	195	172	172
query16	7708	700	528	528
query17	1629	804	658	658
query18	2104	510	385	385
query19	235	250	210	210
query20	148	152	139	139
query21	267	135	120	120
query22	4810	4981	4891	4891
query23	34591	33768	33913	33768
query24	8472	2503	2508	2503
query25	691	567	474	474
query26	1291	303	178	178
query27	3070	531	387	387
query28	4760	2253	2221	2221
query29	817	658	508	508
query30	311	260	220	220
query31	975	863	779	779
query32	77	84	66	66
query33	608	398	354	354
query34	796	846	513	513
query35	786	831	746	746
query36	990	1040	899	899
query37	132	105	81	81
query38	3532	3504	3451	3451
query39	1450	1411	1424	1411
query40	221	125	117	117
query41	60	58	58	58
query42	125	109	111	109
query43	486	482	468	468
query44	1230	743	737	737
query45	182	178	172	172
query46	880	981	643	643
query47	1747	1803	1693	1693
query48	380	421	320	320
query49	779	529	431	431
query50	648	685	407	407
query51	4065	3899	3867	3867
query52	108	111	101	101
query53	239	273	194	194
query54	302	293	275	275
query55	94	87	84	84
query56	335	340	310	310
query57	1158	1179	1092	1092
query58	287	284	275	275
query59	2537	2680	2579	2579
query60	342	336	342	336
query61	158	159	162	159
query62	780	730	671	671
query63	233	198	195	195
query64	4614	1294	976	976
query65	4036	3986	3996	3986
query66	1065	446	350	350
query67	15090	15248	15026	15026
query68	8473	867	602	602
query69	517	337	302	302
query70	1313	1344	1297	1297
query71	495	355	329	329
query72	6174	4918	4991	4918
query73	663	609	362	362
query74	8918	8847	8711	8711
query75	3601	3395	2744	2744
query76	3418	1153	740	740
query77	738	406	312	312
query78	9642	10270	8867	8867
query79	2286	835	599	599
query80	650	588	513	513
query81	520	262	229	229
query82	281	157	130	130
query83	271	270	251	251
query84	259	113	93	93
query85	879	489	442	442
query86	361	307	292	292
query87	3666	3813	3660	3660
query88	3867	2310	2261	2261
query89	395	336	300	300
query90	2027	222	221	221
query91	166	162	132	132
query92	79	69	63	63
query93	1742	1005	642	642
query94	691	430	324	324
query95	399	328	323	323
query96	488	586	292	292
query97	2909	2969	2909	2909
query98	238	221	211	211
query99	1359	1395	1314	1314
Total cold run time: 279704 ms
Total hot run time: 190528 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.66 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f8096bcf3829bddcc921ea2efc42fa276740d569, data reload: false

query1	0.06	0.04	0.05
query2	0.10	0.05	0.05
query3	0.25	0.08	0.09
query4	1.60	0.11	0.12
query5	0.27	0.26	0.25
query6	1.15	0.64	0.65
query7	0.03	0.03	0.03
query8	0.05	0.05	0.04
query9	0.60	0.53	0.51
query10	0.58	0.57	0.57
query11	0.17	0.11	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.61
query14	1.00	1.01	0.99
query15	0.85	0.84	0.83
query16	0.39	0.41	0.41
query17	1.00	1.02	1.00
query18	0.22	0.20	0.19
query19	1.89	1.78	1.79
query20	0.02	0.01	0.01
query21	15.41	0.18	0.12
query22	5.07	0.07	0.04
query23	15.67	0.26	0.10
query24	2.67	0.76	0.62
query25	0.08	0.06	0.06
query26	0.14	0.14	0.15
query27	0.07	0.05	0.05
query28	3.88	1.14	0.93
query29	12.59	3.91	3.28
query30	0.28	0.14	0.11
query31	2.82	0.60	0.39
query32	3.23	0.55	0.47
query33	3.08	3.03	3.03
query34	15.86	5.17	4.56
query35	4.50	4.60	4.55
query36	0.67	0.50	0.49
query37	0.09	0.07	0.07
query38	0.06	0.05	0.03
query39	0.04	0.03	0.03
query40	0.18	0.15	0.14
query41	0.08	0.04	0.03
query42	0.04	0.04	0.03
query43	0.04	0.03	0.04
Total cold run time: 97.55 s
Total hot run time: 27.66 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 83.16% (1274/1532) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.81% (18195/34453)
Line Coverage 38.24% (165913/433834)
Region Coverage 33.25% (129173/388499)
Branch Coverage 34.03% (55524/163178)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.68% (1282/1532) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.00% (24727/33872)
Line Coverage 60.14% (261346/434579)
Region Coverage 55.72% (219604/394095)
Branch Coverage 57.24% (94042/164287)

@Ryan19929
Copy link
Contributor Author

run external

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 68.10% (79/116) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@zzzxl1993 zzzxl1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

PR approved by anyone and no changes requested.

Copy link
Member

@airborne12 airborne12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@airborne12 airborne12 changed the title [feat](inverted index) add custom analyzer support with pinyin tokenzer and pinyin filter [feature](inverted index) add custom analyzer support with pinyin tokenzer and pinyin filter Nov 5, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

PR approved by at least one committer and no changes requested.

@airborne12 airborne12 merged commit 94fe6aa into apache:master Nov 5, 2025
28 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 5, 2025
…enzer and pinyin filter (#57097)

### What problem does this PR solve?

Support  pinyin tokenizer and filter;
- tokenizer example
``` 
 CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS pinyin_tokenizer
  PROPERTIES
  (
                    "type" = "pinyin",
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
- filter example
```
 CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS pinyin_filter1
  PROPERTIES
  (
                    "type" = "pinyin", 
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
github-actions bot pushed a commit that referenced this pull request Nov 5, 2025
…enzer and pinyin filter (#57097)

### What problem does this PR solve?

Support  pinyin tokenizer and filter;
- tokenizer example
``` 
 CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS pinyin_tokenizer
  PROPERTIES
  (
                    "type" = "pinyin",
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
- filter example
```
 CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS pinyin_filter1
  PROPERTIES
  (
                    "type" = "pinyin", 
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
@Ryan19929 Ryan19929 deleted the custom-analysis-pinyin branch November 5, 2025 09:07
yiguolei pushed a commit that referenced this pull request Nov 8, 2025
…h pinyin tokenzer and pinyin filter #57097 (#57729)

Cherry-picked from #57097

Co-authored-by: Ryan19929 <43268112+Ryan19929@users.noreply.github.com>
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 18, 2025
…enzer and pinyin filter (apache#57097)

### What problem does this PR solve?

Support  pinyin tokenizer and filter;
- tokenizer example
``` 
 CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS pinyin_tokenizer
  PROPERTIES
  (
                    "type" = "pinyin",
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
- filter example
```
 CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS pinyin_filter1
  PROPERTIES
  (
                    "type" = "pinyin", 
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
@yiguolei yiguolei mentioned this pull request Dec 2, 2025
Ryan19929 added a commit to Ryan19929/doris that referenced this pull request Dec 24, 2025
…enzer and pinyin filter (apache#57097)

### What problem does this PR solve?

Support  pinyin tokenizer and filter;
- tokenizer example
``` 
 CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS pinyin_tokenizer
  PROPERTIES
  (
                    "type" = "pinyin",
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
- filter example
```
 CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS pinyin_filter1
  PROPERTIES
  (
                    "type" = "pinyin", 
                    "keep_separate_first_letter" = "false",
                    "keep_full_pinyin" = "true",
                    "keep_original" = "true",
                    "limit_first_letter_length" = "16",
                    "lowercase" = "true",
                    "remove_duplicated_term" = "true"
  );
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/3.1.x-conflict dev/4.0.2-merged lfs-detected! Warning Label for use when LFS is detected in the commits of a Pull Request reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

Comments