ed6ccb804daecddd652a563894a678b1.ppt
- Количество слайдов: 58
® IBM Software Group Globalizing Software Markus Scherer & Mark Davis © 2005 -2006 IBM Corporation
IBM Software Group Presentation Goals § Gain fundamental understanding of globalization § Become able to advise users of existing software § Know how to find more information
IBM Software Group International Markets
IBM Software Group International Markets 2
IBM Software Group Globalization & Localization Globalization Localization § Single character set § Based on globalized software § Single executable § Single install § Single server serves all clients in all languages § Adds specific translations and adaptations for particular languages and markets Globalized software can be localized without code changes
IBM Software Group Isolated System Model § For example, using cp 932 (Shift-JIS) for text § Not prepared to deal with other data sources
IBM Software Group Connected System Model § Arbitrary data sources, any language, any place, any code page § Character set mismatch causes data corruption § Data format mismatch causes data corruption
IBM Software Group What is Unicode? Unicode provides a unique number for every character ﺃﺴﺎﺍ، ﺗﺘﻌﺎﻣﻞ ﺍﻟﺤﻮﺍﺳﻴﺐ ﻓﻘﻂ ﻣﻊ ﺍﻷﺮﻗﺎﻡ ユニコードは、すべての文字に固有の番号を付与します יוניקוד מקצה מספר ייחודי לכל תו Η κωδικοσελίδα Unicode προτείνει έναν και μοναδικό αριθμό για κάθε χαρακτήρα
IBM Software Group Why Unicode? § Avoids data corruption § Single encoding for text in all languages § Makes software globalization possible 4 Vastly reduces development cost 4 Vastly reduces maintenance, update and support cost
IBM Software Group Non-Globalized Component § Does not use Unicode § Hard-coded date/time formatting & parsing § Hard-coded number & currency formatting & parsing § Hard-coded collation (sorting/searching/matching) § Other hard-coded operations § Hard-coded literals
IBM Software Group Convert to Unicode ü Unicode § Dates & times § Numbers & currencies § Collation § Literals § Unicode can be UTF-8 or UTF-16
IBM Software Group Hard-Coded Date/Time Formatting & Parsing ü Unicode § Dates & times § Numbers & currencies § Collation § Literals date → month + “/” + day + “/” + year
IBM Software Group Reroute to Service: Date Formatting / Parsing ü Unicode ü Dates & times § Numbers & currencies § Collation § Literals 2005年 12月14日水曜日 date 14. Dezember 2005 ….
IBM Software Group Hard-Coded Number Formatting & Parsing ü Unicode ü Dates & times § Numbers & currencies § Collation § Literals <currency, number> → “$” + integer + “. ” + decimals
IBM Software Group Reroute to Service: Number Formatting / Parsing ü Unicode ü Dates & times ü Numbers & currencies § Collation § Literals <currency, number> 1 234, 57 руб. 1, 234. 57 Rubles …
IBM Software Group Hard-Coded Collation (Sorting) ü Unicode ü Dates & times ü Numbers & currencies § Collation § Literals A<Ä<B<Z
IBM Software Group Reroute to Service: Collation ü Unicode ü Dates & times ü Numbers & currencies ü Collation § Literals <string 1, string 2> Ä<Z Z<Ä …
IBM Software Group Hard-Coded String Literals ü Unicode ü Dates & times ü Numbers & currencies ü Collation § Literals menu. Item. set. Title(“File”)
IBM Software Group Reroute to Service: Translated Resource Lookup ü Unicode ü Dates & times ü Numbers & currencies ü Collation ü Literals “File”, German Chinese Resource Manager French … “Datei” German …
IBM Software Group Services § Charset Conversions § Formatting & Parsing 4 Date & time 4 Messages 4 Numbers & currencies § Translated Names 4 Languages, Regions (Countries), Scripts, Timezones, Currencies § Calendar, Time Zone, Date/Time conversions § Collation 4 Searching, Sorting, Matching § Segmentation 4 word, line, … § Transforms 4 Normalization 4 Casing 4 Transliterations § Unicode Regular Expressions § Complex-Text Display / Input § …
IBM Software Group Globalization Preferences Example Standard § Language en_US (or en-US) RFC 3066 (or successor) § Territory AU ISO 3066 § Currency EUR ISO 4217 § Timezone Australia/Melbourne TZDB § Calendar islamic-civil CLDR Calendar ID § Custom Date yyyy-mmm-dd CLDR Pattern Format § VAT 08. 23% (books) 15. 73% (food) App/Country-Specific § … … … Exact Composition Depends on System Requirements!
IBM Software Group Incremental System Migration § Large system: Change components incrementally § Adapters between modified and original components § Unicode bus between modified components Adapter Unicode bus
IBM Software Group Code Page Adapter § Unicode ⊃ Code Page § Characters missing in code page: 4 Escape (e. g. , XML/HTML: &#x 20 AC; ) or 4 Error (if handshake possible) or 4 Downgrade (replacement character) Unicode Conversion Code Page
IBM Software Group Neutral Data Formats § Do not use localized formats for internal data § E. g. monetary value 4$123. 4 → USA? Australia? Zimbabwe? 4 Interchange complete data: include currency code 4 Use <numeric value, currency code> e. g. <1. 234× 102, USD> § Neutral Formats 4 Faster processing 4 Unambiguous 4 Convert (format/parse) at User Interface boundaries 4 en_US: $123. 40 en_AU: US$123. 4 hi_IN: $१२३. ४०
IBM Software Group Unicode Overview § Unicode Text Encodings § Unicode Gives Characters Meaning and Behavior 4 Data 4 Algorithms § Case Mapping § Forms of Text § Right-To-Left and Bi-Directional Text § Sorting, Searching, Matching § Security § Common Locale Data Repository
IBM Software Group Unicode Text Encodings UTF-16 § In-memory strings, best for processing § Java, . Net, Windows, Mac. OS X, Java. Script, inside browsers, … String aa=“au 00 E 4”; UTF-8 § Storage & Protocols §. txt, . html, . xml, … <? xml version="1. 0" encoding="UTF-8"? >
IBM Software Group Unicode Text Encoding Examples Character Code Point UTF-16 UTF-8 a U+0061 61 ä U+00 E 4 C 3 A 0 σ U+03 C 3 CF 83 א U+05 D 0 D 7 90 ٣ U+0663 D 9 A 3 カ U+30 AB E 3 82 AB 退 U+9000 E 9 80 80 U+21 BC 1 D 846 DFC 1 F 0 A 1 AF 81
IBM Software Group Unicode Gives Characters Meaning and Behavior: Data Ideographic 不 与 Uppercase a ξ Alphabetic ੫ → 5 Quotation_Mark " ' « » ‘ ’ 『 』 ٣→ 3 ৪→ 4 A Ξ Numeric_Value
IBM Software Group Unicode Gives Characters Meaning and Behavior: Algorithms § Case mapping § Case folding & Case-insensitive comparison § Collation § Bidi § Normalization § Line Breaking §…
IBM Software Group Case Mapping dz ↔ Dz ↔ DZ Heiß → HEISS → heiss όσος ↔ ΌΣΟΣ topkapı istanbul ↔tr TOPKAPI İSTANBUL
IBM Software Group Forms of Text ä U+00 E 4 = a+¨ U+0061 + U+0308 § Equivalent text – equivalent behavior § Same display (for supported repertoire) § Normalization generates unique forms
IBM Software Group Right-To-Left and Bi-Directional Text (، ﺃﺒـﻞ IBM). آﻲ. ﺑﻲ. ﺇﻡ (، ﻳﻭـﺖ ﺑـﺎﻛـﺮﺩ APPLE) ،(Hewlett-Packard) ﻣﺎﻳﻜﺮﻭﺳﻮﻓﺖ (، ﺃﻮﺭﺍـﻞ Microsoft) (Sun) (، ﺻﻦ Oracle) … ISO ) ١٠٦٤٦ ﺇﻳﺰﻭ (10646 § Text stored in logical order: No special consideration for processing, only for UI and for legacy encoding conversion § RTL text (mostly Arabic and Hebrew) flows from right to left § Embedded numbers and LTR text flow right to left § Line break preserves reading order § Selection: Contiguous text ≠ contiguous display
IBM Software Group Sorting, Searching, Matching § Binary order A < C < Z < a < c < z < Ç 4 Code Point Order (same as UTF-8 binary comparison) 4 UTF-16 Order (Java String binary comparison) 4 Refinements, usually only for matching, not sorting § Case-insensitive § Matching equivalent forms of text § Language-sensitive collation a < A < c < C < Ç < z < Z
IBM Software Group Collation: UCA + Language Tailorings § Context-sensitive, language-sensitive 4 china < China < chinas 4æ ≅ a+e 4 c < d <. . . k < ch < l 4 Adding/removing trailing character can change sorting considerably § String → Sequence of weights; not reversible § Attributes: Lowercase first, ignore case or punctuation, …
IBM Software Group Security: Spoofing with Look-Alikes Olive – 01 ive ICU – 1 CU Ham – Harn Paypal – Paypаl § Not new with Unicode, but more opportunities due to more characters § UTR #36: Unicode Security Considerations
IBM Software Group Common Locale Data Repository (CLDR) § Industry standard for locale data § Adoption brings consistency across industry § Display names for languages, countries, currencies, etc. § Date/time/number formats and data for parsing § Language tailorings for collation and text segmentation
IBM Software Group Globalization Service Libraries § On Windows only, use Win 32 or. Net APIs § In Java, use ICU 4 J § Other platforms/cross-platform in C/C++, use ICU 4 C § Other programming languages have wrappers for ICU or are planning to integrate ICU, e. g. , PHP, Python
IBM Software Group What is ICU? § International Components for Unicode § Globalization / Unicode / Locales § Mature, widely used set of C/C++ and Java libraries 4 Basis for Java 1. 1 internationalization, but goes far beyond Java 1. 1 § Very portable – identical results on all platforms / programming languages 4 C/C++: 30+ platforms/compilers 4 Java: IBM & Sun JDK 4 You can use: C/C++ (ICU 4 C), Java (ICU 4 J), C/C++ with Java (ICU 4 JNI) § § Full threading model Customizable Modular Open source – but non-restrictive
IBM Software Group Who uses ICU? § Products Within IBM 4 All 5 major software brands 4 Many other related software applications 4 Used on all IBM operating systems § Other Companies and Organizations 4 Adobe, Apple (Mac OS X), Avaya, BEA, Broad. Jump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, Open. Office, Parrot, Pay. Pal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), Su. SE Linux, Sybase, Virage, web. Methods, Wine, Leica Geosystems GIS & Mapping LLC. , Xerox, Yahoo!. . . and many more
IBM Software Group ICU Features § Unicode text handling § Breaks: word, line, … § Charset conversions (700+) § Formatting § Collation & Searching § Locales from CLDR (250+) § Resource Bundles § Calendar & Time zones § Complex-text layout engine § Unicode Regular Expressions 4 Date & time 4 Messages 4 Numbers & currencies § Transforms 4 Normalization 4 Casing 4 Transliterations
IBM Software Group Architecture Overview 1 § Locale Based Services 4 Locale is an identifier, not a container 4 Keywords for variants: de@collation=phonebook § Resource inheritance: shared resources root Language en de zh Hant Script Region US IE DE CH Hans TW CN CN TW
IBM Software Group Architecture Overview 2 § Open and Close Service Model 4 Open a service object, use it many times, close it when done 4 Better performance by avoiding setup costs per operation § ICU Threading Model 4 Multiple service objects in use simultaneously with same or different attributes 4 Large resources shared in read-only cache 4 Compatible with Java threading model
IBM Software Group Architecture Overview 3 § Data Driven Services 4 Customize at build-time or run-time 4 Interchange with other platforms; § same results on each 4 Rule-based § Collation, Word-breaks, Transforms 4 Pattern-based § Date/Time/Number/Message formatting 4 Table-based § Character Conversion
IBM Software Group Architecture Overview – ICU 4 J § Supplement for Java § Core globalization (no character conversion or regular expressions) 4 We do supply complex text support for Sun § Modularized: products may add just needed functionality § Usually drop-in replacement for JDK functionality 4 Changing the import statements is usually all that is needed
IBM Software Group Character Set Conversion § Precise alias information: 4 When you ask for “Shift-JIS”, you can request the precise definition by platform (e. g. Windows, IBM, Java, … ) § Runtime customizations allowed for: 4 illegal sequences 4 undefined characters
IBM Software Group Collation: Sorting, Searching and Matching § Fast international comparison for string search; fully UCA compliant 4 Compressed sort keys, optimized string comparison, sublinear string search 4 Incremental sortkeys used for radix sorting § Precise binary sortkey stability over time (library versioning)
IBM Software Group Calendar & Time Zones § International Calendars – Islamic, Buddhist, Hebrew, Japanese 4 Required for correct presentation of dates in some countries § Olson timezone support with localizations
IBM Software Group Unicode Regular Expressions § Full Regex Implementation 4 C/C++ only: Java 1. 4 has own package (though not as powerful) § All Unicode 4. 1 Properties 4 Supported through Unicode. Set § Good performance 4 Competitive with non-Unicode regex
IBM Software Group References Unicode: http: //www. unicode. org/ IBM software globalization: http: //ibm. com/software/globalization ICU docs & papers: http: //icu. sourceforge. net/docs/ ICU: http: //ibm. com/software/globalization/icu ICU (IBM intranet): http: //icu. sanjose. ibm. com/
IBM Software Group Q & A
IBM Software Group Backup Slides
IBM Software Group Thought Experiment: Alternative to Unicode § Could have tagged pieces of text with code pages § À la ISO 2022 § Like tagging each integer value with whether it is encoded with 1’s complement or 2’s complement § Too hard to use, too many problems § Instead: One single encoding for all languages
IBM Software Group Architecture Overview – ICU 4 C § Simple Error Handling 4 Thread safe 4 Works in C and C++ § C/C++ subset for portability § Version Management 4 Multiple versions of ICU 4 C in the same process memory space 4 Data and library versioning § String Buffer Management 4 Preflighting and overflow protection § Flexible 4 Allows Loading and Unloading ICU 4 C libraries 4 Runtime settable memory allocation and mutex functions
IBM Software Group ICU 4 J: Supplement for Java § CLDR (Common Locale Data Repository) 4 More fully supported locales than Java § Up-to-date globalization: standards-compliant; latest Unicode 4 Supplementary character (GB 18030, JIS X 213, HKSCS) § Java 5 adds handling of supplementary characters 4 Full properties – JDK has only a fraction 4 Unicode Collation Algorithm 4 Local calendars (Islamic, Japan, …); more time zone localizations 4 Currencies, String Search, Internationalized Domain Names 4 Transforms: Case, Scripts, Normalization § Much shorter release cycle and quicker support for Unicode standard
IBM Software Group Unicode Text Handling 2 § All Unicode 4. 1 properties 4 direct API § values, names, enumerations 4 Unicode. Set § Fast, compact set operations (union, intersection, …) § Pattern-based (both Perl & POSIX syntax for properties) – p{greek} vs. [: greek: ] § All properties: – [p{lowercase}-[a-z]] – [p{greek} & p{uppercase}]
IBM Software Group Formatting § Date & time: 8 formats per locale by default § Messages 4 Completely localizable, plural support § Numbers & currencies 4 Scientific Notation, Spelled-out (checks, etc. ) 4 Full Orthogonal Currency support § INR In Hindi: In English: In German: र१ , २३४. ५७ Rs. 1, 234. 57 Rs. 1. 234, 57 § Recent Additions 4 List available currencies API 4 Short and stand-alone month/day names
IBM Software Group Transforms § Unicode Normalization 4 Highly optimized for performance 4 performance utilities: concatenation, detection, comparison § Casing (upper, lower, title, folding) § General Transforms 4 Script transliterations 4 Half-width/Full-width, Hex, etc. 4 Chain transforms together, filter source characters 4 Rule-based, customizable at runtime. § String Prep: NFS, Internationalized Domain Names (IDN)
IBM Software Group Segmentation: word, line & sentence § Fast state-table implementation § Customizable 4 Rule-based – customizable at runtime 4 Special customizations, e. g. Thai § Recent Additions: 4 Uses new UText API § Discontinuous text § Buffering § Usable with UTF-8, UTF-16 or UTF-32
ed6ccb804daecddd652a563894a678b1.ppt