Implementation of the semi-structured inference model in our ACL 2023 paper: INFOSYNC: Information Synchronization across Multilingual Semi-structured Tables. To explore the dataset online visit project page.
@inproceedings{khincha-etal-2023-infosync,
title = "{I}nfo{S}ync: Information Synchronization across Multilingual Semi-structured Tables",
author = "Khincha, Siddharth and
Jain, Chelsi and
Gupta, Vivek and
Kataria, Tushar and
Zhang, Shuo",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.159",
pages = "2536--2559",
abstract = "Information Synchronization of semi-structured data across languages is challenging. For example, Wikipedia tables in one language need to be synchronized with others. To address this problem, we introduce a new dataset InfoSync and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset ({\textasciitilde}3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en {\textless}-{\textgreater} non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 532 table pairs. Our approach obtains an acceptance rate of 77.28{\%} on Wikipedia, showing the effectiveness of the proposed method.",
}
Below are the details about the INFOSYNC datasets and scripts for reproducing the results reported in the ACL 2023 paper.
The code requires python 3.6+
Clone this repository on your machine - https://github.com/Info-Sync/InfoSync.git
Install requirements by typing the following command-
pip install -r requirements.txt
Download and unpack the INFOSYNC datasets into ./data in the main InfoSync folder.
Carefully read the LICENCE and the Datasheet for non-academic usage.
After downloading, you have multiple sub-folders with several csv/html/json files. Each csv file in the sub-folders has 1st rows as a header:
data
│
├── tables
│ ├── json # contains json data for all the categories. Files are in html format for easies understanding of the data
│ └── html # data scraped for all the categories
│
├── final_test_set # test set, built using semi-automated pipeline. Annotated by humans using translations
│ ├── Final_Test_Set_Eng_X # contains files for Eng_X for all categories
│ ├── Final_Test_Set_X_Y # contains files for X_Y for all categories
│ ├── Final_Test_Set_Eng_X.json # json file with all annotations for Eng_X
│ └── Final_Test_Set_X_Y.json # json file with all annotations for X_Y
│
│
├── true_test_set # true-test-set, annotated by native speakers of Hindi and Chinese without using translations
│ ├── True_Test_Set_HI # contains files for Eng_HI for all categories
│ ├── True_Test_Set_ZH # contains files for Eng_ZH for all categories
│ ├── True_Test_Set_HI.json # json file with all annotations for Eng_HI
│ └── True_Test_Set_ZH.json # json file with all annotations for Eng_ZH
|
├── metadata # Human annotators also classify the types of errors present in the test data in one of the five categories 1) Disambiguation 2) Multiple alignments 3) Partial or incorrect extraction 4) Wrong_translations 5) Key Paraphrasing. This evaluation helps standardizing and comparing update methods against each other.
│ ├── Metadata_Eng_X
│ ├── Metadata_X_Y
│ ├── Metadata_Eng_X.json
| └── Metadata_X_Y.json
│
|
├── updation_data # this folder contains json file which are finally used to execute the update algorithm
│ ├── Gold.json # "Gold.json" comprises alignments sourced from the final_test_set that have undergone meticulous human annotation.
│ ├── Live.json # "Live.json" is generated by executing our alignment pipeline on the live-updates dataset. This file captures the alignments produced by our automated process when applied to real-time data updates.
│ ├── Model.json # "Model.json" contains alignments derived from the final_test_set prior to undergoing any human annotation. These alignments originate directly from our automated alignment model.
|
│
├── csv_data # csv data for all categories, with links for different wikipedia pages
│
└── LICENSE, Datasheet, README.md, logo #license,datasheet,dataset readme, logo files.
data/csv_data/ and data/tables/ will be the primary datasets folders to work on here.
Preprocessing is separated into the following steps.
First scrape the infobox from all the wikipedia pages and store them as html. Assume the data is downloaded and unpacked into data/tables/html
cd scripts/data_collection/
python3 infoboxextractor.py
python3 csvtohtml.py
Second convert the html data into json format(key-value pairs from the infobox). Assume the data is downloaded and unpacked into data/tables/json
python3 htmltojson.py
The tables for all the 14 languages are translated to English
cd scripts/data_collection/
python3 spacy_install.py
python3 preprocessing.py
python3 mbart_translate.py
python3 marian_translate.py
A final_translations.html is created in each language folder.
To create dictionaries
cd scripts/data_collection/
python3 dictionary_creation.py
data/tables/, data/final_test_set/, data/true_test_set/ and data/metadata/ will be the primary datasets folders to work on here.
To run the script you must set the correct values in the configuration file( below).
The script can be run using the command
python3 alignment.py -cnf alignment_config.ini
The output is a json file with the alignments stored.
The alignment script can be run in 2 modes.
Pair mode
Dataset mode
In pair mode, the alignment model is run on a specific table pair. In Dataset mode, the alignment model is run on an entire dataset.
In pair mode, you need to specify the following in the config:
Category
Table Name(as in dataset)
Language 1
Language 2
File to write the output to
In dataset mode:
Language Type(Eng_X or X_Y)
File to write the output to
Any one of these two sections needs to be filled based on the mode selected in the [running] section.
The [data] and [alignment_params] sections also need to be filled with the relevant information.
If ablations are set to True in the [running] section, setting any parameter in the [ablations] section to True ablates that parameter from the model.
If metric is set to True in the [running] section, on providing a path to a relevant test set in the [metric] section will print the relevant Precision and Recall scores.
data/updation_data/ will be the primary dataset folder to work on here.
python align_update/updation.py
You would see the print with the update results