Data Science Junior Research Group

Welcome to the Data Science Junior Research Group

The Data Science Junior Research Group (DS-JRG) investigates research questions in a data-driven manner. Among the main research topics of the group are explainable artificial intelligence and causal artificial intelligence for knowledge graphs. Past research focused on vandalism detection in knowledge graphs.

Click here for more details on courses and publications.

Research

In the following, you find an overview about the research topics.

Explainable Artificial Intelligence (XAI)

Explainable Artificial Intelligence

Explainable Artificial Intelligence deals with the explanation of machine learning models. XAI Research that members of the Data Science Junior Research Group have particularly contributed to includes the following.

Explainable White-Box Models

Our research focuses on learning concepts in description logics from positive and negative nodes in a knowledge graph. The concepts serve a explainable, white-box models able to make new predictions in a transparent way. Concepts can be learned with neural networks (ESWC 2022, ESWC 2023, ECML 2023, IJCAI 2024) or with evolutionary algorithms and random walks (WWW 2022, CIKM 2023).

Explainable Black-Box Models

We study ways to explain the predictions of graph neural networks. For example, we explain graph neural networks with concepts in description logics (CIKM 2024).

Causal Artificial Intelligence (CAI)

Causality Graphs

Causal knowledge is seen as one of the key ingredients to advance artificial intelligence. Yet, few knowledge bases comprise causal knowledge to date. To close this gap, we compiled CauseNet (CIKM 2020), a large-scale knowledge base of claimed causal relations between causal concepts. It contains more than 11 million causal relations extracted from the web.

Causal Question Answering Dataset

At least 5% of questions submitted to search engines ask about cause-effect relationships in some way. To support the development of tailored approaches that can answer such questions, we construct CausalQA, a benchmark corpus of 1.1 million causal questions with answers (COLING 2022).

Causal Question Answering with Reinforcement Learning

As many current approaches to causal question answering cannot provide explanations or evidence for their answers, we aim to answer causal questions with a causality graph. As a first step, inspired by recent, successful applications of reinforcement learning to knowledge graph tasks, such as link prediction and fact-checking, we answer binary causal questions by means of reinforcement learning (WWW 2024). Our evaluation shows that the reinforcement agent successfully prunes the search space by over 99% compared to a naive breadth-first search. The paths returned by our agent explain the mechanisms by which a cause produces an effect. Moreover, for each edge on a path, our causality graph provides its original source allowing for easy verification of paths.

Vandalism Detection

Vandalism Corpus

We constructed the large-scale Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism detection in knowledge bases (SIGIR 2015). Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia.

Vandalism Detection

Wikidata is a large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity.Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. We developed new machine learning-based approach to detect vandalism in Wikidata (CIKM 2016). Moreover, we organized the WSDM Cup 2017 (WSDM 2017) - a data science challenge with the task of vandalism detection in Wikidata.

Debiasing Vandalism Detection Models

Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new vandalism detection model that avoids them (WWW 2019).