Name	Name	Last commit message	Last commit date
parent directory ..
align_update	align_update
data_collection	data_collection
translation	translation
.DS_Store	.DS_Store
README.md	README.md
requirements.txt	requirements.txt

InfoSync

Implementation of the semi-structured inference model in our ACL 2023 paper: INFOSYNC: Information Synchronization across Multilingual Semi-structured Tables. To explore the dataset online visit project page.

@inproceedings{khincha-etal-2023-infosync,
    title = "{I}nfo{S}ync: Information Synchronization across Multilingual Semi-structured Tables",
    author = "Khincha, Siddharth  and
      Jain, Chelsi  and
      Gupta, Vivek  and
      Kataria, Tushar  and
      Zhang, Shuo",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.159",
    pages = "2536--2559",
    abstract = "Information Synchronization of semi-structured data across languages is challenging. For example, Wikipedia tables in one language need to be synchronized with others. To address this problem, we introduce a new dataset InfoSync and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset ({\textasciitilde}3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en {\textless}-{\textgreater} non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 532 table pairs. Our approach obtains an acceptance rate of 77.28{\%} on Wikipedia, showing the effectiveness of the proposed method.",
}

Below are the details about the INFOSYNC datasets and scripts for reproducing the results reported in the ACL 2023 paper.

0. Prerequisites

The code requires python 3.6+

Clone this repository on your machine - https://github.com/Info-Sync/InfoSync.git

Install requirements by typing the following command- pip install -r requirements.txt

Download and unpack the INFOSYNC datasets into ./data in the main InfoSync folder.

Carefully read the LICENCE and the Datasheet for non-academic usage.

After downloading, you have multiple sub-folders with several csv/html/json files. Each csv file in the sub-folders has 1st rows as a header:

data
│ 
├── tables
│   ├── json				    # contains json data for all the categories. Files are in html format for easies understanding of the data
│   └── html                                # data scraped for all the categories
│
├── final_test_set		            # test set, built using semi-automated pipeline. Annotated by humans using translations
│   ├── Final_Test_Set_Eng_X 		    # contains files for Eng_X for all categories
│   ├── Final_Test_Set_X_Y 	            # contains files for X_Y for all categories
│   ├── Final_Test_Set_Eng_X.json 	    # json file with all annotations for Eng_X
│   └── Final_Test_Set_X_Y.json 	    # json file with all annotations for X_Y
│
│
├── true_test_set			    # true-test-set, annotated by native speakers of Hindi and Chinese without using translations
│   ├── True_Test_Set_HI 		    # contains files for Eng_HI for all categories
│   ├── True_Test_Set_ZH 		    # contains files for Eng_ZH for all categories
│   ├── True_Test_Set_HI.json 	            # json file with all annotations for Eng_HI
│   └── True_Test_Set_ZH.json 		    # json file with all annotations for Eng_ZH
|
├── metadata 			            # Human annotators also classify the types of errors present in the test data in one of the five categories 1) Disambiguation 2) Multiple alignments 3) Partial or incorrect extraction 4) Wrong_translations 5) Key Paraphrasing. This evaluation helps standardizing and comparing update methods against each other.
│   ├── Metadata_Eng_X 
│   ├── Metadata_X_Y						
│   ├── Metadata_Eng_X.json 							
|   └── Metadata_X_Y.json
│
|
├── updation_data		 # this folder contains json file which are finally used to execute the update algorithm
│   ├── Gold.json                   # "Gold.json" comprises alignments sourced from the final_test_set that have undergone meticulous human annotation.
│   ├── Live.json                   # "Live.json" is generated by executing our alignment pipeline on the live-updates dataset. This file captures the alignments produced by our automated process when applied to real-time data updates.
│   ├── Model.json                 # "Model.json" contains alignments derived from the final_test_set prior to undergoing any human annotation. These alignments originate directly from our automated alignment model.
|
│   		  
├── csv_data				    # csv data for all categories, with links for different wikipedia pages
│
└── LICENSE, Datasheet, README.md, logo	    #license,datasheet,dataset readme, logo files.

1. Dataset Construction

data/csv_data/ and data/tables/ will be the primary datasets folders to work on here.

1.1 Collection

Preprocessing is separated into the following steps.

First scrape the infobox from all the wikipedia pages and store them as html. Assume the data is downloaded and unpacked into data/tables/html

cd scripts/data_collection/
python3 infoboxextractor.py
python3 csvtohtml.py

Second convert the html data into json format(key-value pairs from the infobox). Assume the data is downloaded and unpacked into data/tables/json

python3 htmltojson.py

1.2 Translation

The tables for all the 14 languages are translated to English

cd scripts/data_collection/
python3 spacy_install.py
python3 preprocessing.py
python3 mbart_translate.py
python3 marian_translate.py

A final_translations.html is created in each language folder.

1.3 Dictionary

To create dictionaries

cd scripts/data_collection/
python3 dictionary_creation.py

2. Table Alignment

data/tables/, data/final_test_set/, data/true_test_set/ and data/metadata/ will be the primary datasets folders to work on here. To run the script you must set the correct values in the configuration file( below). The script can be run using the command

python3 alignment.py -cnf alignment_config.ini

The output is a json file with the alignments stored.

Alignment Configuration(alignment_config.ini)

The alignment script can be run in 2 modes.

Pair mode
Dataset mode

In pair mode, the alignment model is run on a specific table pair. In Dataset mode, the alignment model is run on an entire dataset.

In pair mode, you need to specify the following in the config:

Category
Table Name(as in dataset)
Language 1
Language 2
File to write the output to

In dataset mode:

Language Type(Eng_X or X_Y)
File to write the output to

Any one of these two sections needs to be filled based on the mode selected in the [running] section.

The [data] and [alignment_params] sections also need to be filled with the relevant information.

If ablations are set to True in the [running] section, setting any parameter in the [ablations] section to True ablates that parameter from the model.

If metric is set to True in the [running] section, on providing a path to a relevant test set in the [metric] section will print the relevant Precision and Recall scores.

3. Table Updation

data/updation_data/ will be the primary dataset folder to work on here.

python align_update/updation.py

You would see the print with the update results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

InfoSync

0. Prerequisites

1. Dataset Construction

1.1 Collection

1.2 Translation

1.3 Dictionary

2. Table Alignment

Alignment Configuration(alignment_config.ini)

3. Table Updation

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

InfoSync

0. Prerequisites

1. Dataset Construction

1.1 Collection

1.2 Translation

1.3 Dictionary

2. Table Alignment

Alignment Configuration(alignment_config.ini)

3. Table Updation