đ Multilingual Question Generation with Flan-T5
Generate high-quality training data in English, German, and French using our fine-tuned Flan-T5 model â the perfect starting point for building your own multilingual QA systems.
Appearance
Generate high-quality training data in English, German, and French using our fine-tuned Flan-T5 model â the perfect starting point for building your own multilingual QA systems.
Our cutting-edge approach to graph linearization enables you to train your KBQA model with ease. Transform your sub-graphs into textual form and unlock the full potential of your data.
Generate multilingual QA pairs from any KG using our novel graph isomorphism method â enabling seamless knowledge transfer across domains.
INFO
This project is part of the Master Thesis "Multilingual Question Generation from Knowledge Graphs" at the University of Zurich (UZH), Department of Informatics (IFI). The final version of the thesis is now available here: View Thesis đ. The project received a grade of 5.75 đ , which is near the top of the Swiss grading scale (maximum: 6.0).
Nowadays, search engines are used to answer a wide range of questions in various languages from large Knowledge Bases (KBs). In the field of Natural Language Processing (NLP), this process is known as Knowledge Base Question Answering (KBQA). To effectively train a KBQA model, a significant amount of data is required. To obtain this training data, another model can be used to generate questionâanswer pairs, a method referred to as Knowledge Base Question Generation (KBQG). This thesis demonstrates how to generate multilingual (i.e., English, German, and French), high-quality, complex questions ranging from simple factoid questions to more complex questions that require either a verification of a statement, aggregation of items, or drawing comparisons between two entities. The model used is the Flan-T5, a state-of-the-art Large Language Model (LLM), combined with the high-quality KQA Pro dataset. To use this Sequence-to-Sequence (Seq2Seq) model, the sub-graphs of the Knowledge Graph (KG) are transformed into a textual form using a process known as graph linearization (see Equation 1). The proposed model shows promising results on existing KBQG datasets. For example, on the GrailQA dataset, the model achieves a score of 46.21% for BLEU-4, 59.36% for ROUGE-L, and 69.20% for METEOR; and on the WebQuestions dataset, it scores 32.37% for BLEU-4, 57.72% for ROUGE-L, and 61.83% for METEOR. Building on the KQA Pro dataset, this thesis further introduces a novel method for generating questionâanswer pairs from any given KG. This is accomplished by using graph isomorphism to sample a sub-graph with a structure similar to the KQA Pro dataset. Through knowledge transfer, the model can generate high-quality, multilingual, complex questions from any given KG. The limitations of the proposed model include complexity in aggregation, indicating challenges in generating questions when the answer is only a numerical value or the provided answer is False
or 0
, which may result in an empty graph as input. The overall method is outlined in Figure 1.
Equation 1: Transformation of a directed graph into a unified textual form that can be used by a Seq2Seq model. The triples are shuffled before concatenation (