Скачать презентацию IBM Software Group Globalizing Software Markus Scherer Скачать презентацию IBM Software Group Globalizing Software Markus Scherer

ed6ccb804daecddd652a563894a678b1.ppt

  • Количество слайдов: 58

® IBM Software Group Globalizing Software Markus Scherer & Mark Davis © 2005 -2006 ® IBM Software Group Globalizing Software Markus Scherer & Mark Davis © 2005 -2006 IBM Corporation

IBM Software Group Presentation Goals § Gain fundamental understanding of globalization § Become able IBM Software Group Presentation Goals § Gain fundamental understanding of globalization § Become able to advise users of existing software § Know how to find more information

IBM Software Group International Markets IBM Software Group International Markets

IBM Software Group International Markets 2 IBM Software Group International Markets 2

IBM Software Group Globalization & Localization Globalization Localization § Single character set § Based IBM Software Group Globalization & Localization Globalization Localization § Single character set § Based on globalized software § Single executable § Single install § Single server serves all clients in all languages § Adds specific translations and adaptations for particular languages and markets Globalized software can be localized without code changes

IBM Software Group Isolated System Model § For example, using cp 932 (Shift-JIS) for IBM Software Group Isolated System Model § For example, using cp 932 (Shift-JIS) for text § Not prepared to deal with other data sources

IBM Software Group Connected System Model § Arbitrary data sources, any language, any place, IBM Software Group Connected System Model § Arbitrary data sources, any language, any place, any code page § Character set mismatch causes data corruption § Data format mismatch causes data corruption

IBM Software Group What is Unicode? Unicode provides a unique number for every character IBM Software Group What is Unicode? Unicode provides a unique number for every character ﺃﺴﺎﺍ، ﺗﺘﻌﺎﻣﻞ ﺍﻟﺤﻮﺍﺳﻴﺐ ﻓﻘﻂ ﻣﻊ ﺍﻷﺮﻗﺎﻡ ユニコードは、すべての文字に固有の番号を付与します יוניקוד מקצה מספר ייחודי לכל תו Η κωδικοσελίδα Unicode προτείνει έναν και μοναδικό αριθμό για κάθε χαρακτήρα

IBM Software Group Why Unicode? § Avoids data corruption § Single encoding for text IBM Software Group Why Unicode? § Avoids data corruption § Single encoding for text in all languages § Makes software globalization possible 4 Vastly reduces development cost 4 Vastly reduces maintenance, update and support cost

IBM Software Group Non-Globalized Component § Does not use Unicode § Hard-coded date/time formatting IBM Software Group Non-Globalized Component § Does not use Unicode § Hard-coded date/time formatting & parsing § Hard-coded number & currency formatting & parsing § Hard-coded collation (sorting/searching/matching) § Other hard-coded operations § Hard-coded literals

IBM Software Group Convert to Unicode ü Unicode § Dates & times § Numbers IBM Software Group Convert to Unicode ü Unicode § Dates & times § Numbers & currencies § Collation § Literals § Unicode can be UTF-8 or UTF-16

IBM Software Group Hard-Coded Date/Time Formatting & Parsing ü Unicode § Dates & times IBM Software Group Hard-Coded Date/Time Formatting & Parsing ü Unicode § Dates & times § Numbers & currencies § Collation § Literals date → month + “/” + day + “/” + year

IBM Software Group Reroute to Service: Date Formatting / Parsing ü Unicode ü Dates IBM Software Group Reroute to Service: Date Formatting / Parsing ü Unicode ü Dates & times § Numbers & currencies § Collation § Literals 2005年 12月14日水曜日 date 14. Dezember 2005 ….

IBM Software Group Hard-Coded Number Formatting & Parsing ü Unicode ü Dates & times IBM Software Group Hard-Coded Number Formatting & Parsing ü Unicode ü Dates & times § Numbers & currencies § Collation § Literals → “$” + integer + “. ” + decimals

IBM Software Group Reroute to Service: Number Formatting / Parsing ü Unicode ü Dates IBM Software Group Reroute to Service: Number Formatting / Parsing ü Unicode ü Dates & times ü Numbers & currencies § Collation § Literals 1 234, 57 руб. 1, 234. 57 Rubles …

IBM Software Group Hard-Coded Collation (Sorting) ü Unicode ü Dates & times ü Numbers IBM Software Group Hard-Coded Collation (Sorting) ü Unicode ü Dates & times ü Numbers & currencies § Collation § Literals A<Ä

IBM Software Group Reroute to Service: Collation ü Unicode ü Dates & times ü IBM Software Group Reroute to Service: Collation ü Unicode ü Dates & times ü Numbers & currencies ü Collation § Literals Ä

IBM Software Group Hard-Coded String Literals ü Unicode ü Dates & times ü Numbers IBM Software Group Hard-Coded String Literals ü Unicode ü Dates & times ü Numbers & currencies ü Collation § Literals menu. Item. set. Title(“File”)

IBM Software Group Reroute to Service: Translated Resource Lookup ü Unicode ü Dates & IBM Software Group Reroute to Service: Translated Resource Lookup ü Unicode ü Dates & times ü Numbers & currencies ü Collation ü Literals “File”, German Chinese Resource Manager French … “Datei” German …

IBM Software Group Services § Charset Conversions § Formatting & Parsing 4 Date & IBM Software Group Services § Charset Conversions § Formatting & Parsing 4 Date & time 4 Messages 4 Numbers & currencies § Translated Names 4 Languages, Regions (Countries), Scripts, Timezones, Currencies § Calendar, Time Zone, Date/Time conversions § Collation 4 Searching, Sorting, Matching § Segmentation 4 word, line, … § Transforms 4 Normalization 4 Casing 4 Transliterations § Unicode Regular Expressions § Complex-Text Display / Input § …

IBM Software Group Globalization Preferences Example Standard § Language en_US (or en-US) RFC 3066 IBM Software Group Globalization Preferences Example Standard § Language en_US (or en-US) RFC 3066 (or successor) § Territory AU ISO 3066 § Currency EUR ISO 4217 § Timezone Australia/Melbourne TZDB § Calendar islamic-civil CLDR Calendar ID § Custom Date yyyy-mmm-dd CLDR Pattern Format § VAT 08. 23% (books) 15. 73% (food) App/Country-Specific § … … … Exact Composition Depends on System Requirements!

IBM Software Group Incremental System Migration § Large system: Change components incrementally § Adapters IBM Software Group Incremental System Migration § Large system: Change components incrementally § Adapters between modified and original components § Unicode bus between modified components Adapter Unicode bus

IBM Software Group Code Page Adapter § Unicode ⊃ Code Page § Characters missing IBM Software Group Code Page Adapter § Unicode ⊃ Code Page § Characters missing in code page: 4 Escape (e. g. , XML/HTML: &#x 20 AC; ) or 4 Error (if handshake possible) or 4 Downgrade (replacement character) Unicode Conversion Code Page

IBM Software Group Neutral Data Formats § Do not use localized formats for internal IBM Software Group Neutral Data Formats § Do not use localized formats for internal data § E. g. monetary value 4$123. 4 → USA? Australia? Zimbabwe? 4 Interchange complete data: include currency code 4 Use e. g. <1. 234× 102, USD> § Neutral Formats 4 Faster processing 4 Unambiguous 4 Convert (format/parse) at User Interface boundaries 4 en_US: $123. 40 en_AU: US$123. 4 hi_IN: $१२३. ४०

IBM Software Group Unicode Overview § Unicode Text Encodings § Unicode Gives Characters Meaning IBM Software Group Unicode Overview § Unicode Text Encodings § Unicode Gives Characters Meaning and Behavior 4 Data 4 Algorithms § Case Mapping § Forms of Text § Right-To-Left and Bi-Directional Text § Sorting, Searching, Matching § Security § Common Locale Data Repository

IBM Software Group Unicode Text Encodings UTF-16 § In-memory strings, best for processing § IBM Software Group Unicode Text Encodings UTF-16 § In-memory strings, best for processing § Java, . Net, Windows, Mac. OS X, Java. Script, inside browsers, … String aa=“au 00 E 4”; UTF-8 § Storage & Protocols §. txt, . html, . xml, …

IBM Software Group Unicode Text Encoding Examples Character Code Point UTF-16 UTF-8 a U+0061 IBM Software Group Unicode Text Encoding Examples Character Code Point UTF-16 UTF-8 a U+0061 61 ä U+00 E 4 C 3 A 0 σ U+03 C 3 CF 83 א U+05 D 0 D 7 90 ٣ U+0663 D 9 A 3 カ U+30 AB E 3 82 AB 退 U+9000 E 9 80 80 U+21 BC 1 D 846 DFC 1 F 0 A 1 AF 81

IBM Software Group Unicode Gives Characters Meaning and Behavior: Data Ideographic 不 与 Uppercase IBM Software Group Unicode Gives Characters Meaning and Behavior: Data Ideographic 不 与 Uppercase a ξ Alphabetic ੫ → 5 Quotation_Mark " ' « » ‘ ’ 『 』 ٣→ 3 ৪→ 4 A Ξ Numeric_Value

IBM Software Group Unicode Gives Characters Meaning and Behavior: Algorithms § Case mapping § IBM Software Group Unicode Gives Characters Meaning and Behavior: Algorithms § Case mapping § Case folding & Case-insensitive comparison § Collation § Bidi § Normalization § Line Breaking §…

IBM Software Group Case Mapping dz ↔ Dz ↔ DZ Heiß → HEISS → IBM Software Group Case Mapping dz ↔ Dz ↔ DZ Heiß → HEISS → heiss όσος ↔ ΌΣΟΣ topkapı istanbul ↔tr TOPKAPI İSTANBUL

IBM Software Group Forms of Text ä U+00 E 4 = a+¨ U+0061 + IBM Software Group Forms of Text ä U+00 E 4 = a+¨ U+0061 + U+0308 § Equivalent text – equivalent behavior § Same display (for supported repertoire) § Normalization generates unique forms

IBM Software Group Right-To-Left and Bi-Directional Text (، ﺃﺒـﻞ IBM). آﻲ. ﺑﻲ. ﺇﻡ (، IBM Software Group Right-To-Left and Bi-Directional Text (، ﺃﺒـﻞ IBM). آﻲ. ﺑﻲ. ﺇﻡ (، ﻳﻭـﺖ ﺑـﺎﻛـﺮﺩ APPLE) ،(Hewlett-Packard) ﻣﺎﻳﻜﺮﻭﺳﻮﻓﺖ (، ﺃﻮﺭﺍـﻞ Microsoft) (Sun) (، ﺻﻦ Oracle) … ISO ) ١٠٦٤٦ ﺇﻳﺰﻭ (10646 § Text stored in logical order: No special consideration for processing, only for UI and for legacy encoding conversion § RTL text (mostly Arabic and Hebrew) flows from right to left § Embedded numbers and LTR text flow right to left § Line break preserves reading order § Selection: Contiguous text ≠ contiguous display

IBM Software Group Sorting, Searching, Matching § Binary order A < C < Z IBM Software Group Sorting, Searching, Matching § Binary order A < C < Z < a < c < z < Ç 4 Code Point Order (same as UTF-8 binary comparison) 4 UTF-16 Order (Java String binary comparison) 4 Refinements, usually only for matching, not sorting § Case-insensitive § Matching equivalent forms of text § Language-sensitive collation a < A < c < C < Ç < z < Z

IBM Software Group Collation: UCA + Language Tailorings § Context-sensitive, language-sensitive 4 china < IBM Software Group Collation: UCA + Language Tailorings § Context-sensitive, language-sensitive 4 china < China < chinas 4æ ≅ a+e 4 c < d <. . . k < ch < l 4 Adding/removing trailing character can change sorting considerably § String → Sequence of weights; not reversible § Attributes: Lowercase first, ignore case or punctuation, …

IBM Software Group Security: Spoofing with Look-Alikes Olive – 01 ive ICU – 1 IBM Software Group Security: Spoofing with Look-Alikes Olive – 01 ive ICU – 1 CU Ham – Harn Paypal – Paypаl § Not new with Unicode, but more opportunities due to more characters § UTR #36: Unicode Security Considerations

IBM Software Group Common Locale Data Repository (CLDR) § Industry standard for locale data IBM Software Group Common Locale Data Repository (CLDR) § Industry standard for locale data § Adoption brings consistency across industry § Display names for languages, countries, currencies, etc. § Date/time/number formats and data for parsing § Language tailorings for collation and text segmentation

IBM Software Group Globalization Service Libraries § On Windows only, use Win 32 or. IBM Software Group Globalization Service Libraries § On Windows only, use Win 32 or. Net APIs § In Java, use ICU 4 J § Other platforms/cross-platform in C/C++, use ICU 4 C § Other programming languages have wrappers for ICU or are planning to integrate ICU, e. g. , PHP, Python

IBM Software Group What is ICU? § International Components for Unicode § Globalization / IBM Software Group What is ICU? § International Components for Unicode § Globalization / Unicode / Locales § Mature, widely used set of C/C++ and Java libraries 4 Basis for Java 1. 1 internationalization, but goes far beyond Java 1. 1 § Very portable – identical results on all platforms / programming languages 4 C/C++: 30+ platforms/compilers 4 Java: IBM & Sun JDK 4 You can use: C/C++ (ICU 4 C), Java (ICU 4 J), C/C++ with Java (ICU 4 JNI) § § Full threading model Customizable Modular Open source – but non-restrictive

IBM Software Group Who uses ICU? § Products Within IBM 4 All 5 major IBM Software Group Who uses ICU? § Products Within IBM 4 All 5 major software brands 4 Many other related software applications 4 Used on all IBM operating systems § Other Companies and Organizations 4 Adobe, Apple (Mac OS X), Avaya, BEA, Broad. Jump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, Open. Office, Parrot, Pay. Pal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), Su. SE Linux, Sybase, Virage, web. Methods, Wine, Leica Geosystems GIS & Mapping LLC. , Xerox, Yahoo!. . . and many more

IBM Software Group ICU Features § Unicode text handling § Breaks: word, line, … IBM Software Group ICU Features § Unicode text handling § Breaks: word, line, … § Charset conversions (700+) § Formatting § Collation & Searching § Locales from CLDR (250+) § Resource Bundles § Calendar & Time zones § Complex-text layout engine § Unicode Regular Expressions 4 Date & time 4 Messages 4 Numbers & currencies § Transforms 4 Normalization 4 Casing 4 Transliterations

IBM Software Group Architecture Overview 1 § Locale Based Services 4 Locale is an IBM Software Group Architecture Overview 1 § Locale Based Services 4 Locale is an identifier, not a container 4 Keywords for variants: [email protected]=phonebook § Resource inheritance: shared resources root Language en de zh Hant Script Region US IE DE CH Hans TW CN CN TW

IBM Software Group Architecture Overview 2 § Open and Close Service Model 4 Open IBM Software Group Architecture Overview 2 § Open and Close Service Model 4 Open a service object, use it many times, close it when done 4 Better performance by avoiding setup costs per operation § ICU Threading Model 4 Multiple service objects in use simultaneously with same or different attributes 4 Large resources shared in read-only cache 4 Compatible with Java threading model

IBM Software Group Architecture Overview 3 § Data Driven Services 4 Customize at build-time IBM Software Group Architecture Overview 3 § Data Driven Services 4 Customize at build-time or run-time 4 Interchange with other platforms; § same results on each 4 Rule-based § Collation, Word-breaks, Transforms 4 Pattern-based § Date/Time/Number/Message formatting 4 Table-based § Character Conversion

IBM Software Group Architecture Overview – ICU 4 J § Supplement for Java § IBM Software Group Architecture Overview – ICU 4 J § Supplement for Java § Core globalization (no character conversion or regular expressions) 4 We do supply complex text support for Sun § Modularized: products may add just needed functionality § Usually drop-in replacement for JDK functionality 4 Changing the import statements is usually all that is needed

IBM Software Group Character Set Conversion § Precise alias information: 4 When you ask IBM Software Group Character Set Conversion § Precise alias information: 4 When you ask for “Shift-JIS”, you can request the precise definition by platform (e. g. Windows, IBM, Java, … ) § Runtime customizations allowed for: 4 illegal sequences 4 undefined characters

IBM Software Group Collation: Sorting, Searching and Matching § Fast international comparison for string IBM Software Group Collation: Sorting, Searching and Matching § Fast international comparison for string search; fully UCA compliant 4 Compressed sort keys, optimized string comparison, sublinear string search 4 Incremental sortkeys used for radix sorting § Precise binary sortkey stability over time (library versioning)

IBM Software Group Calendar & Time Zones § International Calendars – Islamic, Buddhist, Hebrew, IBM Software Group Calendar & Time Zones § International Calendars – Islamic, Buddhist, Hebrew, Japanese 4 Required for correct presentation of dates in some countries § Olson timezone support with localizations

IBM Software Group Unicode Regular Expressions § Full Regex Implementation 4 C/C++ only: Java IBM Software Group Unicode Regular Expressions § Full Regex Implementation 4 C/C++ only: Java 1. 4 has own package (though not as powerful) § All Unicode 4. 1 Properties 4 Supported through Unicode. Set § Good performance 4 Competitive with non-Unicode regex

IBM Software Group References Unicode: http: //www. unicode. org/ IBM software globalization: http: //ibm. IBM Software Group References Unicode: http: //www. unicode. org/ IBM software globalization: http: //ibm. com/software/globalization ICU docs & papers: http: //icu. sourceforge. net/docs/ ICU: http: //ibm. com/software/globalization/icu ICU (IBM intranet): http: //icu. sanjose. ibm. com/

IBM Software Group Q & A IBM Software Group Q & A

IBM Software Group Backup Slides IBM Software Group Backup Slides

IBM Software Group Thought Experiment: Alternative to Unicode § Could have tagged pieces of IBM Software Group Thought Experiment: Alternative to Unicode § Could have tagged pieces of text with code pages § À la ISO 2022 § Like tagging each integer value with whether it is encoded with 1’s complement or 2’s complement § Too hard to use, too many problems § Instead: One single encoding for all languages

IBM Software Group Architecture Overview – ICU 4 C § Simple Error Handling 4 IBM Software Group Architecture Overview – ICU 4 C § Simple Error Handling 4 Thread safe 4 Works in C and C++ § C/C++ subset for portability § Version Management 4 Multiple versions of ICU 4 C in the same process memory space 4 Data and library versioning § String Buffer Management 4 Preflighting and overflow protection § Flexible 4 Allows Loading and Unloading ICU 4 C libraries 4 Runtime settable memory allocation and mutex functions

IBM Software Group ICU 4 J: Supplement for Java § CLDR (Common Locale Data IBM Software Group ICU 4 J: Supplement for Java § CLDR (Common Locale Data Repository) 4 More fully supported locales than Java § Up-to-date globalization: standards-compliant; latest Unicode 4 Supplementary character (GB 18030, JIS X 213, HKSCS) § Java 5 adds handling of supplementary characters 4 Full properties – JDK has only a fraction 4 Unicode Collation Algorithm 4 Local calendars (Islamic, Japan, …); more time zone localizations 4 Currencies, String Search, Internationalized Domain Names 4 Transforms: Case, Scripts, Normalization § Much shorter release cycle and quicker support for Unicode standard

IBM Software Group Unicode Text Handling 2 § All Unicode 4. 1 properties 4 IBM Software Group Unicode Text Handling 2 § All Unicode 4. 1 properties 4 direct API § values, names, enumerations 4 Unicode. Set § Fast, compact set operations (union, intersection, …) § Pattern-based (both Perl & POSIX syntax for properties) – p{greek} vs. [: greek: ] § All properties: – [p{lowercase}-[a-z]] – [p{greek} & p{uppercase}]

IBM Software Group Formatting § Date & time: 8 formats per locale by default IBM Software Group Formatting § Date & time: 8 formats per locale by default § Messages 4 Completely localizable, plural support § Numbers & currencies 4 Scientific Notation, Spelled-out (checks, etc. ) 4 Full Orthogonal Currency support § INR In Hindi: In English: In German: र१ , २३४. ५७ Rs. 1, 234. 57 Rs. 1. 234, 57 § Recent Additions 4 List available currencies API 4 Short and stand-alone month/day names

IBM Software Group Transforms § Unicode Normalization 4 Highly optimized for performance 4 performance IBM Software Group Transforms § Unicode Normalization 4 Highly optimized for performance 4 performance utilities: concatenation, detection, comparison § Casing (upper, lower, title, folding) § General Transforms 4 Script transliterations 4 Half-width/Full-width, Hex, etc. 4 Chain transforms together, filter source characters 4 Rule-based, customizable at runtime. § String Prep: NFS, Internationalized Domain Names (IDN)

IBM Software Group Segmentation: word, line & sentence § Fast state-table implementation § Customizable IBM Software Group Segmentation: word, line & sentence § Fast state-table implementation § Customizable 4 Rule-based – customizable at runtime 4 Special customizations, e. g. Thai § Recent Additions: 4 Uses new UText API § Discontinuous text § Buffering § Usable with UTF-8, UTF-16 or UTF-32