Скачать презентацию Finding Question-Answer Pairs from Online Forums Gao Cong Скачать презентацию Finding Question-Answer Pairs from Online Forums Gao Cong

d96d1695663b988e6c2dc2f1f26e0449.ppt

  • Количество слайдов: 20

Finding Question-Answer Pairs from Online Forums Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Finding Question-Answer Pairs from Online Forums Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’ 08 Speaker: Yi-Ling Tai Date: 2009/02/09

OUTLINE Introduction Algorithms for question detection Algorithms for answer detection Preliminary Graph base propagation OUTLINE Introduction Algorithms for question detection Algorithms for answer detection Preliminary Graph base propagation method Experiments

INTRODUCTION Online forums contain a huge amount of valuable user generated content. It is INTRODUCTION Online forums contain a huge amount of valuable user generated content. It is highly desirable if the human knowledge in user generated content can be extracted and reused. 40 forums were investigated and found that 90% of them contain question-answer knowledge. This paper focus the problem of extracting questionanswer pairs from forums.

INTRODUCTION Mining question-answer pairs from forums has the following applications. Enrich the knowledge base INTRODUCTION Mining question-answer pairs from forums has the following applications. Enrich the knowledge base of CQA(community-based Question Answering services). Access to forum content could be improved by querying question-answer pairs extracted from forums. Augment the knowledge base of chatbot.

INTRODUCTION Question answer pairs embedded in forums are largely unstructured. Question detection Question-mark and INTRODUCTION Question answer pairs embedded in forums are largely unstructured. Question detection Question-mark and 5 W 1 H question words, are not adequate forum data. This paper proposes a sequential patterns based classification method to detect questions.

INTRODUCTION Answer detection multiple questions and answers may be discussed in parallel and are INTRODUCTION Answer detection multiple questions and answers may be discussed in parallel and are often interweaved together. consider each candidate answer as an isolated document and the question as a query. model the relationship between answers to form a graph. For each candidate answer, we can compute an initial score of being a true answer using a ranking method.

ALGORITHMS FOR QUESTION DETECTION Question mark and 5 W 1 H words, are not ALGORITHMS FOR QUESTION DETECTION Question mark and 5 W 1 H words, are not adequate. imperative sentences “I am wondering where I can buy cheap and good clothing in beijing. " question marks are often omitted. short informal expressions should not be regarded as questions. “really? " To complement this, this paper extract labeled sequential patterns to identify sentences.

ALGORITHMS FOR QUESTION DETECTION labeled sequential patterns (LSPs) “i want to buy an office ALGORITHMS FOR QUESTION DETECTION labeled sequential patterns (LSPs) “i want to buy an office software and wonder which software company is best. “ → “wonder which. . . is“ LHS → c, where LHS is a sequence and c is a class label. a sequence there exist integers is contained in if such that the distance between the two adjacent items and in needs to be less than a threshold (we used 5). if the sequence p 1. LHS is contained by p 2. LHS and p 1. c = p 2. c

ALGORITHMS FOR QUESTION DETECTION To mine LSPs, need to pre-process each sentence by applying ALGORITHMS FOR QUESTION DETECTION To mine LSPs, need to pre-process each sentence by applying Part-Of-Speech (POS) tagger(MXPOST Toolkit). keeping keywords including 5 W 1 H, modal words, “wonder", “any" etc. “where can I find a job“ → “where can PRP VB DT NN“ The combination of POS tags and keywords allows us to capture representative features for question sentences by mining LSPs. “ → Q”; “ → Q“

ALGORITHMS FOR QUESTION DETECTION sup(p) is the percentage of tuples in database D that ALGORITHMS FOR QUESTION DETECTION sup(p) is the percentage of tuples in database D that contain the LSP p. conf(p) = EX, , , Its support is 66. 7% and its confidence is 100% , with support 66. 7% and confidence 66. 7%. p 1 is a better indication of class Q than p 2. In our experiments, we empirically set minimum support at 0. 5% and minimum confidence at 85%

ALGORITHMS FOR ANSWER DETECTION Input : a forum thread with the questions annotated; Output ALGORITHMS FOR ANSWER DETECTION Input : a forum thread with the questions annotated; Output : a list of ranked candidate answers for each question. For each question we assume its set of candidate answers to be the paragraphs in the following posts of the question. paragraphs are usually good answer segments in forums. the answers to a question usually appear in the posts after the post containing the question.

ALGORITHMS FOR ANSWER DETECTION three IR methods to rank candidate answers Cosine Similarity where ALGORITHMS FOR ANSWER DETECTION three IR methods to rank candidate answers Cosine Similarity where f(w, X) denotes the frequency of word x in X Query likelihood language model KL-divergence language model

ALGORITHMS FOR ANSWER DETECTION The candidate answers for a questions are not independent in ALGORITHMS FOR ANSWER DETECTION The candidate answers for a questions are not independent in forums. Graph based propagation method Given a question q, and the set Aq of its candidate answers. Build a weighted directed graph denoted as (V; E) Given two candidate answers a 0 and ag, use KL(a 0 | ag) to determine whethere will be an edge a 0 → ag. if 1 / (1 + KL(a 0 | ag)) > µ, an edge will be formed from a 0 to ag.

ALGORITHMS FOR ANSWER DETECTION computing weight the distance between a candidate answer and the ALGORITHMS FOR ANSWER DETECTION computing weight the distance between a candidate answer and the question, denoted by d(q, a). the number of his replying posts and the number of threads initiated by him.

ALGORITHMS FOR ANSWER DETECTION normalize weight in Page. Rank algorithm Computing Propagated Scores Propagation ALGORITHMS FOR ANSWER DETECTION normalize weight in Page. Rank algorithm Computing Propagated Scores Propagation without initial score Propagation with initial score

EXPERIMENTS Data 1, 212, 153 threads from Trip. Advisor forum 86, 772 threads from. EXPERIMENTS Data 1, 212, 153 threads from Trip. Advisor forum 86, 772 threads from. Lonely. Planet forum 25, 298 threads from Bootsn. All Network From the Trip. Advisor data, we randomly sampled 650 threads. Two annotators were asked to tag questions and their answers in each thread.

EXPERIMENTS Q-TUnion a sentence was labeled as a question if it was marked as EXPERIMENTS Q-TUnion a sentence was labeled as a question if it was marked as a question by either annotator; In QTInter a sentence was labeled as a question if both annotators marked it as a question.

EXPERIMENTS EXPERIMENTS

EXPERIMENTS EXPERIMENTS

EXPERIMENTS EXPERIMENTS