This repository contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.
High-level Notes:
- Dependencies: Documents URLs references the Common Crawl June 2017 Archive.
- Data Format:
- Directory structure:
- Each dataset release with have two top-level folders:
abstractiveandextractive. - Each top-level folder contains three sub-folders for
train,devandtestexamples.
- Each dataset release with have two top-level folders:
- File format: TFrecords.
- Fields:
query: input query to be used as summarization context. This is a single valuedbyte_listfeature, derived from Natural Questions user queries.input_urls: List of URLs to input documents pointing to Common Crawl to be summarized. Each URL is separated with a special token separator<EOD>.target: Summarization target, derived from Natural Questions long answers.
- Directory structure:
This is not an official Google product.