- Количество слайдов: 12
The Mobile Web is Structurally Different Apoorva Jindal USC Ravi Jain Google Inc Chris Crutchfield MIT Samir Goel Google Inc Ravi Kolluri Google Inc
The Mobile Web is Structurally Different The Mobile Web? is Structurally Different n Web pages designed for consumption on mobile wireless devices q n n All other pages referred to as fixed web Becoming more important q q q n CHTML, XHTML, WML Better devices Better networks Cheaper plans Different from fixed web? q q q Smaller pages Fewer hyperlinks Fewer images
The Mobile Web is Structurally Different Structurally? n Web graph q q n Properties of this graph q q n pages ↔ nodes hyperlinks ↔ edges In-degree distribution Out-degree distribution Strongly connected component size distribution …. Importance INFOCOM 2008 q Used in basic algorithms to implement search n n n Crawling Ranking the web pages Studied in detail for fixed web EDAS
Bow-tie Structure [Broder et al 2000] n Model to describe the structure of the fixed web.
Methodology n Collapse all pages in a domain to one node Google’s mobile web index, June 2007 q q CHTML XHTML + WML n Webbase 2001 n Google’s fixed web index, July 2007 Use Tools based on Mapreduce n In-degree & out-degree distributions q q n Determine bow-tie structure properties q n Tools based on mapreduce Use [Clauset et al 2006] to infer the power law coefficient Use COSIN tools [Donato et al 2004] Limitations q Cannot handle Google fixed web 2007 at page level
Page-level Graph properties – Degree Distributions Out-degree distribution sparser Mobile web isfalls off faster for mobile web CHTML lies between XHTML+WML and fixed web Corpus Avg Node Degree Coefficient of power-law distribution In-degree Out-degree XHTML+WML 3. 75 2. 00 3. 49 CHTML 5. 06 1. 99 4. 06 Webbase 7. 0 2. 1 2. 7
Page-level Graph properties – Bow-tie structure Corpus OUT Tendril s Discon nected 18% 10. 4% 18. 3% 42. 7% CHTML 22% 25. 9% 14. 2% 22% 15. 8% Webba se 11% 39% 13% 4% 33% Mobile web q q q n IN XHTML 10. 5% +WML n SCC Smaller SCC Larger IN and smaller OUT Bigger Disconnected + Tendrils Connectivity: Fixed Web > CHTML > XHTML/WML
n Language Properties Sub-graph of pages that share a common trait q q q n Corpus Like keyword, location. Called Thematically Unified Clusters (TUCs). In fixed web, they retain the structural properties of the entire graph. Mobile web? Corpus SCC IN OUT Tendrils Disconn ected 42. 6% 10. 5% 18% 10. 4% 18. 3% 42, 7% English 22. 3% XHTML +WML Russian 13. 4% Chinese 13% 22% 9% 14% 42% French 3. 4% German CHTML Fraction of Nodes Chinese XHTML Language 2. 3% English 2% 3% 7% 25% 63% Japanese 92. 3% English 5. 9% Russian 22% 40% 8% 11% 19% Don’t study Japanese: Properties same as CHTML
Domain-level Graph Properties n Domain-level graph q Collapse all nodes for a domain into a single super-node n Compare mobile web 2007 and fixed web 2007 n Advantages q q Corpus Allows us to understand the differences at a much coarser level Allows us to compare present day fixed and mobile webs Avg Node Degree SCC XHTML +WML 3. 91 40. 6% CHTML 5. 56 IN 40. 7% OUT 2. 73% Tendrils + Disconn. n q q 15. 9% q 83% 16. 4% 0. 22% 0. 36% n Fixed web 2007 Observations 35. 75 93. 9% 5. 62% 0. 4% 0. 03% Domain-level graphs are better connected. XHMTL + WML has a much larger Disconnected component CHTML properties lies between XTHML+WML and Fixed web. Structural differences between domain -level fixed web and mobile web same as the differences between page-level fixed web and mobile web.
Application: Impact on Crawling is resource-intensive. q n Efficiency is important Higher level of disconnectedness q Need a larger and a more diverse seed set n Covering the IN component requires special care n Depth-first strategy risks spending a disproportionate time in Tendrils and Disconnected components n Different languages have different levels of disconnectedness q q n Require a larger seed set for English pages than Russian pages Crawl depth can be reduced for Russian sub-graph Sparseness also can give an advantage q Chances of encountering the page again during a crawl is smaller
Conclusions n Mobile web graph is structurally different q q Sparser, more disconnected Smaller SCC and OUT n CHTML properties lies between XHTML+WML and Fixed web n Surprising preponderance of Chinese pages n English sub-graph extremely disconnected
Future Work n Only a first step n Results motivate the need of a deeper and more extensive analysis n Propose alternatives to bow-tie model for mobile web n Better understanding of language sub-graphs n Quantitatively characterize the impact of differences in structure on different search algorithms