Скачать презентацию 1 Describe how the techniques described in ICS

360a8585ab63cc66a20c7c0cae7de250.ppt

• Количество слайдов: 21

1. Describe how the techniques described in ICS 207 could be used to spot students who cheat on take home tests. What enhancements would be necessary if instead of copying word for word, students replaced some words with synonyms? Discuss the precision/recall trade-off in this case. If the penalty for copying were severe e. g. , expulsion from the university vs. get no credit for copied work, would that effect how the system might work? 1

• Store each answer to each question. (To improve precision, create a separate index per question). Compute TF-IDF weights of terms – Treat each answer as query and see what answers are most similar using cosine similarity. – Return the N pairs most similar, or within a certain threshold. – (or cluster the documents) • Have the professor or TA judge the results. High recall catches all cheaters but has many false alarms. High precision returns few false alarms but lets more cheaters escape punishment. • Severe punishment may decrease the probability of cheating but also increases the importance of not having a false alarm. • Consider using LSA to deal with synonyms, or thesaurus, 2 but be careful. Consider using phrases of 2 -3 words.

2. Give an example of a stemming algorithm. Why are such techniques used as part of IR systems? Describe how the technique is included as part of the indexing, query interpretation and matching processes. Porter stemmer: finds root forms of terms (. *[AEIOU]. *)ED -> /1 // removed past tense –ed ending Stem documents before indexing Stem queries before retrieval Matching: No change 3

3. Why is the simple frequency of a index's occurrence in a document insufficient as a predictor of its utility? Give an example of a particular index and a particular corpus that demonstrates this. Give the formula for one term weighting scheme, and identify these two factors in the formula. 4

• Use alone, term frequency favors common words in the corpus. • Doesn’t take into account the discriminability of terms • Need a way to balance term frequency in the document with frequency in all documents such as term frequency divided by document frequency wkd=fkd / (Dk/NDoc) where: wkd = the weight for a keyword (k) in a document (d) fkd = the frequency of a keyword (k) in a document (d) NDoc = the total number of documents in the corpus Dk = the number of documents containing the keyword (k). • Probably should normailize for document length and use log of Document frequency • Not: Zipf isn’t document specific, so doesn’t really apply here. Any variant of TF-IDF would be fine 5

TF-IDF columbia-0. 422 shuttle-0. 26 gear-0. 235 wing-0. 225 plasma -0. 186 breach-0. 169 wheel-0. 156 compartment-0. 142 sensor-0. 139 tire-0. 134 temperature-0. 13 unusual-0. 107 hartsfield-0. 098 spacecraft-0. 095 huntsville-0. 095 superheat-0. 089 spikes-0. 087 Frequency: 4 the 3 left 3 in 2 wing 2 wheel 2 well 2 thursday 2 that 2 its 2 heat 2 had 2 from 2 columbia's 2 board 2 an 2 a 7

4. What is the “clustering hypothesis”? How can this hypothesis be exploited by an IR system? Van Rijsbergen’s cluster hypothesis, “closely associated documents tend to be relevant to the same request. ” The cluster hypothesis suggests that if we first compute a measure of distance X among all the documents, we could cluster those documents that seem to be close to one another or seem to about the same topic and retrieve them all in response to a query that matched any of them. Thanks to Jon Froehlich 8

5. How does a linear support vector machine differ from a perceptron in classification? Why is this difference important? Linear support vector machine find the optimal dividing line by maximizing the margin between the positive and negative examples Perceptron finds any separation. SVM has less variance with different training sets and initial conditions and the optimal solution is more often a better generalization than a random one. 9

10

6. Define precision and recall precisely. What foolproof method could an IR system use to achieve perfect recall? What would you expect the precision of this method to be? Retr = a set of retrieved documents Rel = a set of relevant documents N = number of documents Recall = | Retr Rel | / |Rel| Precision = | Retr Rel | / |Retr| Return all documents to get perfect recall. Precision of this method is |Rel|/N which can be close to 0 Note define terms in formulae 11

7. What is a reasonable heuristic for identifying the outstanding papers in an area of research with which you are unfamiliar? The documents with the most citations 12

8. Use any general web search system to determine whether it is safe to take the drugs Viagra and Phentermine at the same time. What query did you use at first? What was your final query? Why isn’t a general-purpose search engine ideally suited for this problem? Would relevance feedback help here? Would relevance feedback be effective if it only used positive feedback in this case? Lots of answers. “Viagra Phentermine” returns a billion online pharmacies, adding Safety, interaction, doesn’t help. Subtracting –sale, buy, -purchase, -order gets closer. Negative feedback could help rules these Just using one drug helps because there isn’t a interaction, and documents tend to list what is not safe instead of what is safe. “drug interactions” get you to specialized site with a database instead of a search engine. Its easier to find something that exists than to verify it doesn’t with search engine technology. 13

9. What problems does Latent Semantic Analysis address? Would this approach be better suited for queries of an encyclopedia or a newspaper? Why? It helps with synonyms such as “car & automobile” “rental and leasing” and homonyms such as “tire” It’s expensive to index but cheap to retrieve so better for collections that don’t change much, such as encyclopedia. 14

11. Pick a topic that interests you. Find an authority for that topic. Find a hub. Argue that the site is a hub or any authority based on the number of links to or from that site. Hint: in Google the query “link: http: //www. uci. edu” will find web sites that link to http: //www. uci. edu 15

Latent Semantic Analysis” in google. com and picked the top 7 links. I typed each of the 10 links in google. com and found out the number of links to it. Then I opened the 7 links and counted the links from it by counting the occurrence of string href=”http So I think http: //citeseer. nj. nec. com/deerwester 90 indexing. html is an authority for this topic. And http: //www. upmf-grenoble. fr/sciedu/blemaire/lsa. html is a hub Thanks to Qi Zhong 16

11 Would a rule-based learner or collaborative filtering be more appropriate for filtering your email? Why? Rule based: It learns from YOUR behavior Collaborative: Compares you to others, but you are the only one reading your mail, so can’t help filter. (It might do okay at SPAM since many people read that, but SPAM is in the eye of the beholder, some people do want to help smuggle \$17 M from Nigeria 17

12. I’m interested in going to the machine learning conference this year. I typed “Machine learning conference” in to Google and it returned the following result: . 1998 International Machine Learning Conference. ICML-2001. ICML 2002 - Sydney. ICML-2000. ECML/PKDD-2002. ICML-99. ECML/PKDD-2001 Home. ICML 2003. International Conferences on Machine Learning and Applications However, most people are more interested in the future conference than the past conference. Why is the 2003 conference ninth instead of first? a. More links to older conference: 172 vs 162. b. Possibly more important links to it 18

13. How could this happen, i. e. , why did google rank Microsoft first to the query “Go to hell”? (although this is no longer first “"Micro\$oft" gets you to Microsoft even though the home page obviously doesn’t contain the term "Micro\$oft" or “Go to hell”. Microsoft tops Google hell search rankings By John Lettice Posted: 18/09/2002 at 17: 30 GMT Weirdisms in Google's ranking system are sent to entertain us occasionally, and as we've had this one from a couple of people today, we suspect it's new. But don't blame us if it's not, we lead sheltered lives sometimes. Currently, if you type "go to hell" into Google, then Microsoft Corporation, Where do you want to go today? 19 comes first.

• Some people created web pages with the link text being “Go to Hell” and the destination microsoft. E. g. , Go To Hell • Microsoft has many links pointing to it, so its pagerank score is high • Google indexes terms in the link as well as terms in the page. 20

Final comments • This practice exam is a random sample of questions drawn from the distribution of possible exam questions. Topics not on this exam, but in the study guide for papers and study guide for the text can appear on the final exam. • Pay attention to instructions: USE THE SUBJECT indicated. “Re: Practice Exam for ICS 207" so I can find it easily” • How long did it take? • Final exam emailed at 6 am PST March 16, due 11: 59 PM March 17 and posted on web site. • Final report format email by March 9: 5 pages max 21