Скачать презентацию Key Challenges in Information Processing James Hamilton James Скачать презентацию Key Challenges in Information Processing James Hamilton James

3c2982e90a1890434d4f4418887e84c8.ppt

  • Количество слайдов: 10

Key Challenges in Information Processing James Hamilton James. RH@microsoft. com Microsoft SQL Server 2002. Key Challenges in Information Processing James Hamilton James. [email protected] com Microsoft SQL Server 2002. 03. 01

Unsolved Challenges 1. 2. 3. 4. Availability shows only incremental progress Security broken & Unsolved Challenges 1. 2. 3. 4. Availability shows only incremental progress Security broken & too hard to manage Weakly structured data poorly supported or exploited Writing Multi-tiered apps too hard Ø 5. Data intensive mid-tiers need more DB help Scalability over perf & big-iron 2

Availability: Largely unsolved problem u 1985 Tandem study (Gray): Ø Ø Ø u 1990 Availability: Largely unsolved problem u 1985 Tandem study (Gray): Ø Ø Ø u 1990 Tandem Study (Gray): Ø Ø Ø u Administration: 42% downtime Software: 25% downtime Hardware 18% downtime Software 62% Administration: 15% Most studies have admin contribution much higher Observations: Ø Ø Ø H/W downtime contribution trending to zero Software & admin costs dominate & growing We’re still looking at 10 to 15 year-old research 3

Availability: Cost in dollars/hour u Brokerage operations $6, 450, 000 u Credit card authorization Availability: Cost in dollars/hour u Brokerage operations $6, 450, 000 u Credit card authorization $2, 600, 000 u Ebay (1 outage 22 hours) $225, 000 u Amazon. com $180, 000 u Package shipping services $150, 000 u Home shopping channel $113, 000 u Catalog sales center $90, 000 u Airline reservation center $89, 000 u Cellular service activation $41, 000 u On-line network fees $25, 000 u ATM service fees $14, 000 From Dave Patterson Talk at HPTS 2001 -- Sources: Internet. Week 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p. 8. ”. . . survey done by Contingency Planning Research. " 4

Availability: Admin still the problem u Administrators expensive Ø u Administrators make mistakes Ø Availability: Admin still the problem u Administrators expensive Ø u Administrators make mistakes Ø u Admin #1 or #2 cause of downtime Big problem yet little research focus: Ø Ø Ø u Admin dominate H/W & S/W costs (5 x or more) Still few data points available: Ø Most systems houses won’t publish. . . need research No benchmarks: Ø Benchmarks drive industry & systems research Goal: Server appliance model: Ø Auto-tuning, pluggable server-side resources Ø IBM SMART, Microsoft index tuning wizard, etc. Dave Patterson, Aaron Brown, Armando Fox, . . . Ø More help needed 5

Availability: the S/W is broken u Even server-side software is BIG: Ø Ø Ø Availability: the S/W is broken u Even server-side software is BIG: Ø Ø Ø u Tester to Developer ratios above 1: 1 Ø Ø u Windows 2000: over 50 mloc DB: 1. 5+ mloc SAP: 37 mloc (4, 200 S/W engineers) Quality per unit line only incrementally improving Current massive testing investment not solving problem New approach needed: Ø Ø Assume S/W failure inevitable Redundant, self-healing systems right approach Ø Tandem process-pair work good but getting fairly old. . . progress? 6

Security: Securing systems too hard u u “Less than 0. 0025% of corp revenue Security: Securing systems too hard u u “Less than 0. 0025% of corp revenue invested in security” – Richard Clarke, Special security advisor to president Data loss, intentional data & systems corruption Ø u S/W Vulnerabilities rampant: Ø Ø u Buffer overruns, stack smashing, code insertion, SQL insertion, elevation of privs, . . . Programmers being more careful doesn’t solve problem Most systems miss-configured: Ø u Clearly under-reported problem Security systems too complex & hard to admin Research needed: Autonomous threat detection Ø Ø better tools to detect, correct, & prevent S/W security vulnerabilities Monitor all measurable system metrics: Ø Detecting new threats & miss-configurations Ø Track execution profiles: detect changes: drive alerts, auto-config, reports to vendor, upgrade s/w, . . . 7

Unstructured Data: u All data has some schema but not always fully known nor Unstructured Data: u All data has some schema but not always fully known nor affordable to pre-declare: Ø Ø u Ø Mapping XML to relational scheamas Ø leverages existing relational IQ but not as flexible New, non-relational (native XML) stores Ø Storing natively doesn’t leverage DB investment Ø Mostly mid-tier data integration servers Research potential: Ø u Most data in unstructured stores with text search DB community is losing Much research work on XML focused upon: Ø u Mostly not stored in DB Native stores leveraging existing infrastructure esp. costbased optimizers, storage engines, & utilities IR work progressing but little integration into DB Ø Integrating IR work into DB W/O required schema, ability to exploit if there, ability to discover/infer if not 8

Multi-tiered apps: we’re not helping u Many high scale multi-tiered apps still hand crafted Multi-tiered apps: we’re not helping u Many high scale multi-tiered apps still hand crafted Ø u u Problem not adequately solved by industry Integration with server-tier DB advantages: Ø Ø u Needed: Object access layer, data cache, queuing, query compiler & optimizer, data directed routing, security, . . . ACID relaxation driven by attributes on apps or data Ø Relaxed models with auto-cache population & mgmt Query parsing for data directed routing Ø Want to parse once & accept same lang as backend Exploit optimizer: model full mid-tier to back-end costs Ø Where to run joins, functions, aggs, etc. Need security integration W/O fully provisioning backend Data intensive mid-tiers are a DB & TP problem: Ø Ø Solve with DB tech & integrate with backend DB Componentized DB for mid-tier use one approach 9

Scalability: perf not the problem u Focus still on performance rather than scalability: Ø Scalability: perf not the problem u Focus still on performance rather than scalability: Ø Ø u Research goal: Server appliances Ø Ø Ø u Clusters only “nearly” work Must buy biggest iron & get most from it Gray’s servers by the brick Ø brick includes disk, memory, & CPU resources Only admin actions required: Ø Add brick to, or defect from, cluster Data redundancy (potentially) on geo-scale: Ø adapts to access patterns & available bandwidth If zero-admin clusters actually worked & scaled: Ø Ø performance would be a secondary issue The admin problem would nearly go away The S/W quality problem greatly simplified Ø Hiesenbugs solved via retry and redundancy Would shift investment dollars from H/W & admin to S/W (where it belongs ) 10