6960b602e6e4138cb08f23944574b4e9.ppt
- Количество слайдов: 74
Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND
PREAMBLE: MODELS FOR TEXT TECHNOLOGY? Intuitive Models Distribution of grammatical categories; constituency; governance; synchronic studies Psychological/ Observational Models Distribution of conceptual categories; acquisition; degeneration; Diachronic studies. Empirical Models: Distribution of linguistic patterns (word, phrases, sentences); collocation; semantic prosody; synchronic/diachronic studies
CORPUS LINGUISTICS The aim of corpus linguistics is ‘to base accounts of language on corpora derived from systematic recordings of conversations and real discourse of other kinds, as opposed to examples obtained by introspection, by judgement of grammarians, or by haphazard observation’; and a corpus is defined ‘as any systematic collection of speech or writing in a language or variety of a language’ (Matthews 1997: 78). Matthews, P. H. (1997). Oxford Concise Dictionary of Linguistics. Oxford & New York: Oxford University Press.
Representative Corpora: The BNC The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20 th century, both spoken and written. The BNC is Monolingual Synchronic General http: //www. natcorp. ox. ac. uk/corpus/index. xml Sample
Representative Corpora: The BNC Characteristic The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The British National Corpus Monolingual DEALS with modern British English, not other languges used in Britain. However non-British English and foreign language words do occur in the corpus. Synchronic COVERS British English of the late twentieth century, rather than the historical development which produced it. General INCLUDES many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language. Sample COMPRISES written sources: (a) samples of 45, 000 words are taken from various parts of single-author texts; (b) shorter texts up to a maximum of 45, 000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids overrepresenting idiosyncratic texts. http: //www. natcorp. ox. ac. uk/corpus/index. xml
Representative Corpora: The BNC. Distribution of the first 100 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time. Token Cumulative Relative Frequency the, of, and, a, in, to, it, is, was, to No. of OCW 21. 28% 0 i, for, you, he, be, with, on, that, by, at 6. 66% 0 are, not, this, but, 's, they, his, from, had, she 4. 35% 0 which, or, we, an, n't, 's, were, that, been, have 3. 25% 0 their, has, would, what, will, there, if, can, all, her 2. 42% 0 as, who, have, do, that, one, said, them, some, could 1. 90% 0 him, into, its, then, two, when, up, time, my, out 1. 57% 1 so, did, about, your, now, me, no, more, other, just 1. 37% 0 these, also, people, any, first, only, new, may, very, should 1. 18% 1 as, like, her, than, as, how, well, way, our, as 1. 02% 0 45. 00% 2 TOTAL http: //www. natcorp. ox. ac. uk/corpus/index. xml
Representative Corpora: The BNC. Distribution of the first 200 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time. Token Cumulative Relative Frequency No. of OCW between, years, er, many, those, there, 've, being, because, do 0. 88% 1 're, yeah, three, down, such, back, good, where, year, through 0. 77% 4 'll, must, still, even, know, too, here, get, own, does 0. 70% 1 oh, last, no, more, 'm, going, so, erm, after, us 0. 65% 1 government, might, same, much, see, yes, go, make, day, man 0. 60% 4 another, world, see, got, work, however, life, against, think 0. 57% 4 never, under, one, most, old, over, know, something, mr, take 0. 54% 2 why, each, while, part, on, number, out_of, made, different, really 0. 49% 3 went, ', came, after, children, always, four, without, one, within 0. 46% 3 system, local, during, most, although, next, small, case, great, things 0. 43% 6 6. 09% 29 TOTAL http: //www. natcorp. ox. ac. uk/corpus/index. xml
LANGUAGE AS A SYSTEM: INPUTS & OUTPUTS Interpertament Texts are responses to previous texts and the texts are then responded to in turn and the cycle continues hence the diachronic dimension
LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis? Year 1477 The orthodoxy An atom is a hypothetical body, so small as to be incapable of further division; and thus to be one of the ultimate particles of nature. 1650 Physical Atoms: The supposed ultimate particles in which matter actually exists (without reference to its stability). 1819 Chemical Atoms: The smallest particles in which the elements combine, or are known to possess the properties of a particular element. NYE, M. J. (1986). The Question of the Atom-From the Karlsruhe Congress to the 1 st Solvay Congress. A compilation of primary sources. Los Angeles: Tomash Publishers.
LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis? Year The orthodoxy 1899 Thomson's atomic structure based on Aepinus one fluid theory of electricity 1904 Nagaoka's 'Saturnian' atom reinterpreting Maxwell’s observations about the planet 1906 Rayleigh’s infinite electron atom: An elaboration of Thomson’s atomic structure 1909 Rutherford's ‘nucleus’ theory- An experimentalist interpreting Nagaoka’s and Crookes’ observations 1913 Bohr's theory of atomic structure – The one great fan of Rutherford’s scattering experiments Conn, G. K. T. and Turner, H. D. (1965). The Evolution of the Nuclear Atom. London: Iliffe Books Ltd, New York: American Elsevier Pub. Co.
LANGUAGE AS A SYSTEM: INPUTS & OUTPUTS Languages are constantly in flux The corpus linguist explores the discourse as a system that can be explained without referring to a discourse external reality or to the mental state of the members of the discourse community. Teubert, Wolfgang (2003). Writing, hermenutics and corpus linguistics. Logos and Language Vol. IV (no. 2) pp 1 -17.
LANGUAGE AS A SYSTEM: INPUTS & OUTPUTS Interpertament Where will you find the evidence of use, definition, and elaboration of terms like: • inclusive learning environment (e-Learning) • Borromean Halo Nuclei (Radioactive Nuclear Beam Physics) • honeycombed catalytic converter (Automotive Engineering) • indivualist weak supervenience (Philosophy of Science) • indoor blood videotaping (Forensic Science) EXCEPT IN A TEXT CORPUS?
Language as a System: The moonlighting terms – Lexicogenesis? Term/‘Concept' Motion: Objects move because of The old ‘truth’ The new ‘truth’ an in-built tendency to something exerts move (Aristotle) 'attraction' (Galileo) Solar Cycle: Sunrise is caused by a rising Sun (Brahe) a turning earth (Kepler) Combustion: The burning of an object means That the mass of the object decreases by losing phlogiston to air (Priestley) The mass of the object increases by gaining oxygen from air (Lavoisier) Heartbeat: Blood circulation is caused by an explosion during diastole of the heart (Descartes) a compression during systole of the heart (Harvey) an absolute phenomenon that has been determined in the past (Linnaeus) a contemporaneous phenomenon with borders between the species (Darwin) Species: The distinction between species Verschuuren, G. M. N. (1986). Investigating the Life Sciences: An Introduction to the Philosophy of Science. Oxford: Pergamon Press.
LANGUAGE & CHANGE DEVELOPMENT OF CONCEPTS: ATOM I. In philosophical and scientific use. In senses 2 and 3 now generally held to consist of a positively charged nucleus, in which is concentrated most of the mass of the atom, and round which orbit negatively charged electrons. 1. A hypothetical body, so infinitely small as to be incapable of further division; and thus held to be one of the ultimate particles of matter, by the concourse of which, according to Leucippus and Democritus, the universe was formed. 2. In Nat. Phil. physical atoms: the supposed ultimate particles in which matter actually exists (without reference to their divisibility or the contrary), aggregates of which held in their places by molecular forces, constitute all material bodies. 3. chemical atoms: a. The smallest particles in which the elements combine either with themselves, or with each other, and thus the smallest quantity of matter known to possess the properties of a particular element. b. The smallest quantity in which a group of elements, called a radical, forms a compound corresponding to one formed by a simple element, or behaves like an element; thus the smallest known quantity of a chemical compound. II. In popular use. 4. From sense 1, as the nearest popular conception to the atoms of the philosophers: One of the particles of dust which are rendered visible by light; a mote in the sunbeam. arch. or Obs. 1784 COWPER Task I. 361 The rustling straw sends up a frequent mist of atoms. 1821 BYRON Two Foscari III. i, Moted rays of light Peopled with dusty atoms. 5. The smallest conceivable portion or fragment of anything; a very minute portion or quantity, a particle, a jot: a. of matter. c 1630 DRUMMOND OF HAWTHORNDEN Poems (1633) 166 Like tinder when flints atoms on it fall. 1644 DIGBY Nat. Bodies vi. (1658) 54 Little attoms of oyl. . ascend apace up the week of a burning candle. 1835 SIR J. ROSS N. -W. Pass. xxxiv. 477 There was not an atom of water. b. of things immaterial. logical atom: one of the essential and indivisible elements into which some philosophers hold that statements can be analysed. 1873 C. S. PEIRCE in Mem. Amer. Acad. Arts & Sci. IX. II. 343 The logical atom, or term not capable of logical division, must be one of which every predicate may be universally affirmed or denied. . . 1918 [see ATOMISM 1 b]. 1958 G. J. WARNOCK Eng. Philos. since 1900 v. 54 Russell's world of indefinitely numerous, independent logical atoms is the metaphysical opposite of Bradley's Absolute. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
LANGUAGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] I. 1. Astr. a. The more condensed portion of the head of a comet. b. A more condensed, usu. brighter, central part of a galaxy or nebula. 2. A supposed interior crust of the earth. Obs. 3. A central part or thing around which other parts or things are grouped, collected, or compacted; that which forms the centre or kernel of some aggregate or mass. a. Of material (esp. more or less solid) things. b. Of communities or groups of persons. c. Of immaterial things. d. Of places, buildings, etc. e. Of collections of things. 4. Archæol. A block of flint or other stone from which early implements have been made. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
LANGUAGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] II. 5. Botany a. The kernel of a nut. Now rare or Obs. b. The kernel of a seed (see quots. ). c. The central part of an ovule. d. In Lichens: (see quot. 1832). e. In Fungi: (see quots. ). f. The hilum of a starch-granule. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
LANGUAGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] II. 6. a. The rudiments of the shell in certain molluscs. b. Any discrete mass of grey matter in the central nervous system. The term is used in numerous English and mod. L. combs. distinguishing the various different nuclei. 7. Biol. A cell organelle present in most of the cells of all organisms except the most primitive, usu. as a single subspherical structure, and consisting (except when undergoing division) of a membrane enclosing a ground substance (the nuclear sap) in which lie the chromosomes, one or more nucleoli, etc. , and functioning as the repository of genetic information and as the director of metabolic and synthetic activity of the cell. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
LANGUAGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] II. 8. Chem. An arrangement of atoms, esp. a ring structure, characteristic of a number of organic compounds. 9. A particle on which crystals, droplets, or bubbles can form in a fluid. 10. A small group of bees, including a queen, used as the foundation of a new colony. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
KNOWLEDGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] II. 11. Physics. The positively charged central constituent of the atom, comprising nearly all its mass but occupying only a very small part of its volume and now known to be composed of protons and neutrons. In Rutherford's 1911 paper called merely a ‘central charge’. In the examples in the first paragraph nucleus is used for various speculative notions concerning the atom. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
KNOWLEDGE & CHANGE DEVELOPMENT OF CONCEPTS: NUCLEUS Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It. , Sp. , and Pg. nucleo. ] II. 12. a. Phonetics. The syllable of a word (spoken in isolation) that bears the primary accent; in an utterance, the syllable or syllables given particular emphasis. b. Linguistics. The main word or words in a combination, phrase, or sentence; also = KERNEL n. 1 8 b. Hence nucleus v. trans. , to make into a nucleus, to concentrate. 1899 KIPLING Stalky 252 They'd withdrawn all the troops they could, but I nucleused about forty Pathans. Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
LEXICAL SIGNATURE? – The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. – The same concept may be referred to by different names; – The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; – Everyday, general language texts seldom carry a signature.
TEXT, TEXTURE, TEXTUALITY Etymologically, text comes from a metaphorical use of the Latin verb textere – weave – suggesting a sequences of sentences or utterances ‘interwoven’ structurally and semantically A text can be regarded as sequential collection of sentences or utterances which form a UNITY by reason of their linguistic COHESION and semantic COHERENCE. However, it is possible for a text to comprise a single sentence, e. g. a road sign.
TEXT, TEXTURE, TEXTUALITY New definitions appended from the OED 1993 text, n. 1 Add: [1. ] e. Short for TEXT-BOOK n. 2. [2. ] d. Linguistics. (A unit of) connected discourse whose function is communicative and which forms the object of analysis and description. Cf. text-frequency, linguistics
TEXT, TEXTURE, TEXTUALITY COHESION IN TEXT Cohesion refers to the means by which sentences in a text are linked to each other, to form a paragraph. This linkage leads to larger units as well: paragraphs in a chapter, chapters in a book. The sentences are made to stick together. The sentences in themselves are words stuck together through the use of and, but, not and so on. Sometimes, we use he, she, it, they…. . for sticking sentences. At other times we repeat words, the same word, words related to the word, sounds related to the word, substitutes for a word
TEXT, TEXTURE, TEXTUALITY : COHESION IN TEXT Two kinds of words which glue a text: GRAMMATICAL WORDS; Conjunctions: and, but, if, then Pronouns: he, she, it, they, them Prepositions: on, in, of Modal verbs: verbs to be LEXICAL WORDS (Repetition) Nouns: Names of person, place and things Adjectives: words that qualify a noun Adverbs: Modifier of a verb INFLECTIONS & DERIVATIONS: Markers for plurals (car cars) Nominalisation: nouns verbs (react reaction)
TEXT, TEXTURE, TEXTUALITY : COHESION IN TEXT LEXICAL WORDS (Repetition) Simple repetition (the word form + plurals): Inflection: reactions Complex Repetition: Derivation: reaction; reactant; motoring crime criminal Paraphrase: Genus/Species/Instance: Electrons/Protons Particles Building blocks of the Universe Compounding: {stable, unstable, trans-Uranic, halo, compound, . . } +nucleus forensic + {analysis, laboratory, technician, science…}
TEXT, TEXTURE, TEXTUALITY : COHERENCE IN TEXT If a text makes sense, then we can identify it as such. Sentences may be connected together because they refer to the same person, place, event or thing. The connectivity, or sticking together, is provided by the content or the meaning. Coherence can be understood more if one looks at literary texts: often terms like plot, narrative, and narration are used to describe the unity of a given literary text.
TEXT, TEXTURE, TEXTUALITY Textuality is a term used to denote the various standards that a text – a collection of linguistic units - should have in order to be regarded as a text. There are many features: Cohesion and coherence being the most prominent. The authors and the readers typically have a plan or purpose when they respectively write and read a text: This is called intentionality. Acceptability is a standard which refers to the possible use a text may have for its readers. A text is generally expected to comprise new information (informativity). A text is typically related to other texts and the readers of a text usually expect it to be the case (intertextuality). A text is expected to have relevance to the context (situationality).
KNOWLEDGE & COMMUNICATION Broadly the process of exchanging information or messages, and human language, in speech and writing, is the most significant and most complex communication system. • A human language-based communications system is comparable to a machine (e. g. computer) based communications system: In 1949, Shannon and Weaver introduced an elegant theory of communication. Messages in Shannon and Weaver’s system are transmitted as signals from transmitter or sender to receiver via the medium of speech, for example, along the channel of sound waves. • The human transmitter, however, is (usually) also the creator of the message; and what may be communicated may not only be factual, or even verbal, but also attitudinal, social or cultural information. Indeed, humans can communicate with each other when they are silent.
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION • Language can be viewed as 'a communicative process based on knowledge. Generally when humans use language, the producer and comprehender are processing information, making use of their knowledge of the language and of the topics of conversation. Language is a process of communication between intelligent active processors, in which both the producer and the comprehender(s) perform complex cognitive tasks.
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION The producer has communicative goals, including effects to be achieved, information to be conveyed, and attitudes to be expressed • The comprehender attempts to understand (the meaning of the producers communicative goals): by reacting (verbally or non-verbally), by inferring new information, by updating existing data about processes or devices, by focusing attention on something or some of its properties, or by preparing for subsequent utterances of the producer
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION A model of co-operative communication
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION A. B. C. D. Input corpus. GL /* a general language corpus comprising NGL individual words*/ Input corpus. SL /* a corpus of specialist texts comprising NSL individual words*/ Conduct a uni-variate analysis of the contrastive distribution of linguistic tokens in the two corpora: extract terminology, ontology Conduct a multi-variate analysis of the tokens within specialist texts to find keywords by the extent to which each keywords accounts for the variance in texts
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION • • I. UNIVARAITE ANALYSIS Input corpus. GL /* a general language corpus comprising NGL individual words*/ Input corpus. SL /* a corpus of specialist texts comprising NSL individual words*/ Contrast the distribution of words in corpus. GL and corpus. SL Select Single Words based on z-scores for relative frequency and weirdness (equivalent to tfidf) Find collocation patterns for selected single words; Find hyponymic patterns using textual markers; Construct a local grammar using collocation and hyponymic links Generate a recursive transition network based on local grammars.
SPECIAL LANGUAGE • The special language of focussed, single minded pursuits: Science, technology, sports, politics, philosophy, …… • A natural language privileges persons ; in contrast the “splinter of ordinary language” that we call [specialised] scientific discourse privileges a world of objects, processes, happenings, events.
SPECIAL LANGUAGE • The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures. • The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events
GENERAL LANGUAGE • The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures. The Trial Franz Kafka (1916) Chapter One Arrest - Conversation with Mrs. Grubach - Then Miss Bürstner Someone must have been telling lies about Josef K. , he knew he had done nothing wrong but, one morning, he was arrested. Every day at eight in the morning he was brought his breakfast by Mrs. Grubach's cook - Mrs. Grubach was his landlady - but today she didn't come. That had never happened before. K. waited a little while, looked from his pillow at the old woman who lived opposite and who was watching him with an inquisitiveness quite unusual for her, and finally, both hungry and disconcerted, rang the bell. There was immediately a knock at the door and a man entered. He had never seen the man in this house before. He was slim but firmly built, his clothes were black and close-fitting, with many folds and pockets, buckles and buttons and a belt, all of which gave the impression of being very practical but without making it very clear what they were actually for. "Who are you? " asked K. , sitting half upright in his bed. The man, however, ignored the question as if his arrival simply had to be accepted, and merely replied, "You rang? " "Anna should have brought me my breakfast, " said K. http: //www. gutenberg. org/dirs/etext 05/ktria 11. txt Translation Copyright (C) by David Wyllie Translator contact email: dandelion@post. cz
GENERAL LANGUAGE • The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures. Total word count : 718 Number of different words : 429 Complexity factor (Lexical Density) : 59. 7% Readability (Gunning-Fog Index) : (6 -easy 20 -hard) 8. 2 Total number of characters : 7901 Number of characters without spaces : 4048 Average Syllables per Word : 1. 44 Sentence count : Average sentence length (words) : Max sentence length (words) : http: //www. gutenberg. org/dirs/etext 05/ktria 11. txt 82 17. 83 106 Translation Copyright (C) by David Wyllie Translator contact email: dandelion@post. cz
GENERAL LANGUAGE • The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures. Word Occurrences Frequency Rank you 20 2. 80% 1 said 14 1. 90% 2 him 12 1. 70% 3 what 10 1. 40% 4 them 9 1. 30% 5 man 9 1. 30% 5 looked 7 1. 00% 6 time 7 1. 00% 6 room 7 1. 00% 6 much 6 0. 80% 7 http: //onlinebooks. library. upenn. edu/webbin/gutbook/lookup? num=7849
SPECIAL LANGUAGE The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events Total word count : 681 Number of different words : 324 Complexity factor (Lexical Density) : Readability (Gunning-Fog Index) : (6 -easy 20 -hard) 47. 60% 8. 4 Total number of characters : 8276 Number of characters without spaces : 4690 Average Syllables per Word : 1. 73 Sentence count : Average sentence length (words) : Max sentence length (words) : Readability (Alternative) beta : (100 -easy 20 -hard, optimal 60 -70) 99 14. 57 55 45. 8
SPECIAL LANGUAGE The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events Word Occurrences Frequency Rank nucleus 26 3. 8% 1 number 20 2. 9% 2 charge 18 2. 6% 3 atom 17 2. 5% 4 atomic 12 1. 8% 5 scattering 12 1. 8% 5 electrons 10 1. 5% 6 nuclear 9 1. 3% 7 elements 8 1. 2% 8 particle 8 1. 2% 8 Nuclear Constitution of Atoms. Bakerian Lectureby SIR E. RUTHERFORD, F. R. S. Cavendish (Professor of Experimental Physics, University of. Cambridge). The Proceedings of the Royal Society, A, 97, 1920, pp. 374400
Terminology, Ontology and Semantics: Theories and Things Ontology and Metaphysics Mäki, Uskali. (2001). (Ed. ) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.
Terminology, Ontology and Semantics: Lexical Signature from Prof Amazon? Font Size Frequency 16 >1000 AND <1500 14 >250 AND <500 12 >100 AND < 250 Mäki, Uskali. (2001). (Ed. ) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.
SPECIAL LANGUAGES Special language is a language used in a subject field and characterized by the use of specific linguistic means of expression. http: //stats. oecd. org/glossary/detail. asp? ID=6151
SPECIAL LANGUAGES Functionality and special language Many researchers have purported to demonstrate that certain languages display particular features which are distinctly suited to serving a purpose required of that language or that have been melded by its use. Type Exemplar Reference natural languages Sacapultec Maya John du Bois (1987) specialist languages language of science Sager, Dungworth and Mc. Donald 1981 legal language Swales and Bhatia 1983 language of commerce Hoffman (1984 Seaspeak Strevens 1984 language of military; police Wikipedia operational languages
SPECIAL LANGUAGES: TEXT TYPES Imaginative Texts; Informative Texts; {Horatory Texts; } {Instructive Texts; }
SPECIAL LANGUAGES: WORD CLASSES Open Class; Closed Class; Additional Class: Numerals & Interjections
SPECIAL LANGUAGE Theoretical Linguistics Nuclear Physics Automotive Engineering Text Size 688733 472108 326621 68 158 132 British English Texts 29 59 95 American English Texts 37 37 36 2 62 1 Imaginative: Adverts 0 0 38 Imaginative: Popular Science 1 31 5 Imaginative: News 0 8 24 Imaginative: Letters 3 8 0 Informative: Journal Papers 30 73 53 Informative: Book- Theses 23 0 0 Informative: Book - Monographs 8 22 4 Informative: Official 0 9 1 Instructive: Manuals 0 0 7 Instructive: Reports 3 7 0 Total No. of Tokens Total Number of Texts Language Variety Other English Texts Register
SPECIAL LANGUAGES: CHARACTERISITICS Characteristics of special languages Interlocking definitions; Technical Taxonomies; Special Expressions; Lexical Density; Syntactic Ambiguity; Grammatical Metaphor; Semantic Discontinuity Halliday and Martin (1993: 71 -84)
SPECIAL LANGUAGES: CHARACTERISITICS Frequency distribution of the first 100 most frequent words in a linguistics corpus. Open class words (OCWs) are indicated through the use of bold type face. Token Cumulative Rel. Frequency the, of, in, a, and, is, to, that, as, for No. of OC W 24. 47% are, be, this, we, it, which, with, by, not, i 6. 92% on, gender, an, from, have, s, can, or, there, nouns 3. 81% 2 one, but, agreement, noun, at, has, o, other, these, n 2. 67% 2 will, form, case, such, if, no, language, two, all, structure 2. 19% 4 some, they, may, between, more, example, e, c, only, mor 1. 99% 2 would, singular, also, semantic, theory, languages, p, its, forms, see 1. 74% 5 b, morphology, plural, same, t, number, where, class, stem, so 1. 51% 5 rules, masculine, then, word, syntactic, given, x, feature, binding, feminine 1. 37% 7 first, second, than, lexical, hierarchy, subject, when, like, however, different 1. 28% 3 47. 95% 30 TOTAL
SPECIAL LANGUAGES: CHARACTERISITICS Frequency distribution of the first 100 most frequent words in nuclear physics corpus. Open class words (OCWs) are indicated through the use of bold type face. Token Cumulative Relative Frequency the, of, in, and, a, to, is, for, that, be No. of OCW 27. 44% 0 with, are, by, this, I, as, from, which, we, on 6. 25% 0 at, energy, it, an, nuclei, nucleus, have, r, these, nuclear 3. 65% 4 s, c, can, neutron, p, electrons, not, will, has, was 2. 64% 2 b, two, scattering, been, phys, one, particles, e, between 2. 19% 4 number, n, such, target, atom, j, cross, m, or, potential 1. 98% 3 h, k, state, where, only, atoms, also, model, very, but 1. 76% 3 electron, mev, states, calculations, structure, than, nucleon, data, q, mass 1. 61% 8 density, more, core, f, d, section, if, theory, order, may 1. 45% 4 t, elements, range, about, first, system, other, however, interaction, matter 1. 31% 5 50. 98% 33 TOTAL
SPECIAL LANGUAGES: CHARACTERISITICS Frequency distribution of the first 100 most frequent words in an automotive engineering corpus. Open class words (OCWs) are indicated through the use of bold type face. Token Cumulative Relative Frequency No. of OC W 24. 11% 0 as, be, are, by, that, this, at, it, which, engine 5. 93% 1 from, fuel, was, system, emissions, catalyst, control, an, exhaust, or 3. 70% 6 have, not, can, vehicle, cars, s, has, air, emission, test 2. 79% 5 vehicles, speed, will, wheel, car, pressure, were, all, brake, been 2. 36% 6 these, than, fig, more, no, but, catalytic, also, only, when 1. 91% 2 high, unleaded, co, braking, new, temperature, european, g, conditions, one 1. 72% 6 abs, use, standards, gas, up, nox, its, hc, if, time 1. 55% 4 sensor, valve, road, systems, engines, during, diesel, they, would, used 1. 38% 5 so, however, two, into, driving, converter, three, low, other, between 1. 28% 2 the, of, and, to, in, a, is, for, with, on
SPECIAL LANGUAGE TERMINOLOGY Terminology. 1. refers to the usage and study of terms, that is to say words and compound words generally used in specific contexts. 2. refers to a more formal discipline which systematically studies of the labelling or designating of concepts particular to one or more subject fields or domains of human activity, through research and analysis of terms in context, for the purpose of documenting and promoting correct usage. This study can be limited to one language or can cover more than one language at the same time (multilingual terminology, bilingual terminology, and so forth). http: //en. wikipedia. org/wiki/Terminology
SPECIAL LANGUAGE TERMINOLOGY Terminology is a subject in its own right with its theoretical formalism, methods, techniques and tools. Terminologists: analyze identify establish compile manage create the concepts and concept structures used in a field or domain of activity the terms assigned to the concepts correspondences between terms in the various languages the terminology, on paper or in databases terminology databases new terms http: //en. wikipedia. org/wiki/Terminology
Simple Methodology Suggested Upper Merged Ontology • • • Extract nouns and verbs from a source text Find classes in SUMO for the nouns and verbs Record a mapping as being either equal, subsuming or instance. – type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser • • Create a subclass of SUMO if it's a subsuming mapping Add properties to the subclass – reusing SUMO properties – extending SUMO properties by creating a &%subrelation of an existing property • Add English definition to the class – define constraints that express how the subclass is more specific than the superclass • Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously Permission to reuse granted so long as this notice is not altered – Author: Adam Pease adampease@earthlink. net, 2003
Simple Methodology an ontology is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. Individuals the basic or "ground level" objects Classes sets, collections, or types of objects Attributes properties, features, characteristics, or parameters that objects can have and share Relations ways that objects can be related to one another http: //en. wikipedia. org/wiki/Ontology_%28 computer_science%29#Domain_ontologies_and_upper_ontologies
PREAMBLE: Special Language A note on creativity or terminicide ‘That science has become more difficult for nonspecialists to understand is a truth universally acknowledged’. The choice of words in a journal paper is very different to that in a quality newspaper – obscuring the work of the scientists. Source Lexical Difficulty Nature 55. 5 Science 44. 8 Cell 31. 6 Physics Today 13. 3 New Scientist 4. 0 Quality Newspaper 0. 0 Donald Hayes (1992) ‘The growing inaccessibility of science’. Nature. Vol 356, pp 739 -740
Lexicogenesis: Diachronic Semantic Change The establishment of the unstable nucleus (1990’s)
LEXICAL SIGNATURE? – The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. – The same concept may be referred to by different names; – The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; – Everyday, general language texts seldom carry a signature.
LEXICAL SIGNATURE? – – The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in forensic science can be identified by the signature: SINGLE TERMS: evidence, crime, scene, forensic, police, identification case, court, analysis, time, information, blood & COMPOUND TERMS: crime scene, forensic evidence, court case, blood analysis, earprint, fingerprint, crime scenes
LEXICAL SIGNATURE? – – The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in all specialist domains show a few repeatedly used terms form the SIGNATURE. These terms are used PRODUCTIVELY – in plural form, as (heads of) compounds, and in derivative forms nucleus crime nuclei (PL. ), nuclear (Adjective); stable/unstable/nuclei; halo/closed shell nuclei; nuclear force/reaction; nuclear matter crime, criminal, crimes, criminals, criminalistics, criminology, criminalist(s), criminological, criminality crime scene; crime of passion; property crime;
BUILDING A THESAURUS – – The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in forensic science can be identified by the signature: SINGLE TERMS: evidence, crime, scene, forensic, police, identification case, court, analysis, time, information, blood & COMPOUND TERMS: crime scene, forensic evidence, court case, blood analysis, earprint, fingerprint, crime scenes
BUILDING A THESAURUS British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0. 58 Million words; SFSC: Relative Frequency the of and to a 6. 8% 3. 7% 2. 5% 2. 4% BNC: Relative Frequency 6. 2% 2. 9% 2. 7% 2. 6% 2. 1% SFSC/BNC: WEIRDNESS 1. 1 1. 2 1. 0 1. 1 The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus.
BUILDING A THESAURUS British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0. 58 Million words; SFSC: Relative Frequency evidence crime scene forensic police 0. 47% 0. 40% 0. 27% 0. 25% BNC: Relative Frequency 0. 021% 0. 007% 0. 001% 0. 028% SFSC/BNC: WEIRDNESS 22 57 40 473 9 The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC.
BUILDING A THESAURUS British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0. 58 Million words; SFSC: Relative Frequency bitemark earprint accelerant pyrolysis ballistics BNC: Relative Frequency 0. 0187% 0% 0. 0137% 0% 0. 0115% 0% 0. 0139% 0. 00001% 0. 0146% 0. 00002% SFSC/BNC: WEIRDNESS 634 1263 The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC.
BUILDING A THESAURUS Collocation patterns – semantic prosody in Surrey Forensic Science Corpus
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION Ahmad, Khurshid. , and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725 -760. Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed. ) Susan Armstrong(Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp 143 -177.
BUILDING A THESAURUS British National Corpus (BNC) = 100 Million words; Surrey Nanotube Corpus (SFSC) = 1. 09 Million words; Collocations with carbon (frequency of 1506) in the Surrey Nanoscale science corpus.
BUILDING A THESAURUS British National Corpus (BNC) = 100 Million words; Surrey Nanotube Corpus (SFSC) = 1. 09 Million words; Collocations with carbon nanotubes (frequency of 647) in the Surrey Nanoscale science corpus.
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION No. Potential ‘Hyponymic’ Patterns 1 NP 0 such as { NP 1, NP 2, , ……………. (and|or) NPn} 2 such NP 0 as { NP 1, NP 2, , ……………. (and|or) NPn} 3 { NP 1, NP 2, , ……………. , NPn} (and|or) other NP 0 4 NP 0 (including|especially) { NP 1, NP 2, , . (and|or) NPn} injury including broken bone, the bow lute, such as the Bambara ndang
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION • This method has been successfully applied in recent years in the synthesis of various metal nanostructures such as nanowires, nanorods, and nanoparticles. • Occasional multiwall carbon nanotubes and other carbon nanostructures were also found following annealing at higher (> °C) temperatures. • The present method will be extended to find and fix nanoparticles including polymers, colloids, micelles, and hopefully biological molecules/tissues in solution. • This technique is promising because many different types of nanowires, like nanotubes or semiconductor nanowires, are now synthetically available
BUILDING A THESAURUS & ONTOLOGY British National Corpus (BNC) = 100 Million words; Surrey Philosophy of Science Corpus = 1. 042 Million words; 164 Texts, 1990 -2000; Journal Papers; Letters, Conference Announcements, Courses
PREAMBLE: Special Language A note on creativity or terminicide Source Quality Newspaper Popular Science Lexical Difficulty 0. 0 -4. 7 (Discover) Fiction Nat. History magazine -19. 3 -22. 6 (Ranger Rick) Children’s fiction Farm-workers talking to cows -27. 4 -63. 8
Lexicogenesis: Diachronic Semantic Change The establishment of the nuclear atom (1890 -1930) Bohr


