The following list shows corpora and datasets close to the core research topics of the CSS group. More data can be found on the web page of our associated group, Webis.
Argument Snippets Dataset
A dataset of 100 arguments retrieved from args.me on top queried topics, in which snippets are extracted by two human experts (Alshomary et al 2020b). [163kb]
args.me corpus comprises 387 606 arguments crawled from four debate portals in the middle of 2019. The debate portals are Debatewise, IDebate.org, Debatepedia, and Debate.org. The arguments are extracted using heuristics that are designed for each debate portal. [849mb]
Arg-Microtexts Synthesis Benchmark
130 logos-oriented and 130 pathos-oriented benchmark arguments for 10 topics, manually synthesized by 26 experts based on a pool of argumentative discourse units from the Arg-Microtexts corpus. [4mb]
In case you publish any results related to the this benchmark data, please cite our upcoming COLING 2018 paper on argumentation synthesis. [bib]
The ArguAna Counterargs Corpus
An English corpus for studying the retrieval of the best counterargument to an argument. It contains 6753 pairs of argument and best counterargument from the online debate portal idebate.org, along with different experiment files with up to millions of candidate pairs. [106mb]
In case you publish any results related to the ArguAna Counterargs corpus, please cite our upcoming ACL 2018 paper on counterarguments. [bib]
The Dagstuhl-15512 ArgQuality Corpus
An English corpus for studying the assessment of argumentation quality. It contains 320 online debate portal arguments, annotated for 15 different quality dimensions by three annotators. [zip v1 1mb] [zip v2 1mb]
In version 2, the annotated XMI files have been changed according to a new underlying type system where each quality dimension is represented by an own annotation. This annotation contains not only the majority score of the respective dimension (as in version 1), but also the mean score and the scores of all annotators. We recommend to use version 2.
In case you publish any results related to the Dagstuhl-15512 ArgQuality corpus, please cite our EACL 2017 paper on argumentation quality. [pdf] [bib].
The Webis-ArgRank-17 Dataset
An English benchmark dataset for studying argument relevance. It contains 32 rankings as well a ground-truth argument graph with more than 30,000 argument units. In addition, we provide the source code to reproduce our ranking experiments based on the dataset. [zip 13mb]
In case you publish any results related to the Webis-ArgRank-17 dataset, please cite our EACL 2017 paper on argument relevance. [pdf] [bib]
The Webis-Editorials-16 Corpus
An English corpus with 300 news editorials from three online news portals, annotated for the types of all argumentative discourse units. [zip 5mb]
In case you publish any results related to the Webis-Editorials-16 corpus, please cite our COLING 2016 paper on argumentation strategies. [pdf] [bib]
The ArguAna TripAdvisor Corpus
An English corpus for studying local sentiment flows and aspect-based sentiment analysis. It contains 2100 hotel reviews balanced with respect to the reviews’ sentiment scores. All reviews are segmented into subsentence-level statements that have then been manually classified as a fact, a positive, or a negative opinion. Also, all hotel aspects mentioned in the reviews have been annotated as such. [zip v1 with software 10mb] [zip v2 8mb]
In addition, we provide nearly 200k further hotel reviews without manual annotations. [v1 upon request] [zip v2 265mb]
The corpus is free-to-use for scientific purposes, not for commercial applications. In version 2, the annotated XMI files have been changed according to a new underlying type system that is more easily extendable. Notice that some adaptations of the software of version 1 are necessary to make it work with version 2.
In case you publish any results related to the ArguAna TripAdvisor corpus, please cite our CICLing 2014 paper. [pdf] [bib]