9654a0ab0031e14f9c71f1c7a8e0806f.ppt
- Количество слайдов: 35
Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey
Financial Decision Making Challenge: analysis of streaming financial (time serial) data and financial and political news At the interface of quantitative and qualitative? ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
FINGRID Project v aimed at information management/ processing challenge in social sciences: analysis and fusion of distributed quantitative and qualitative data and programs. v 12 -month e. SS PDP involving econometrics (Essex) and computing academics, particularly in grid computing and artificial intelligence, at Surrey ( social anthropologists & criminologists) v Third project at Surrey that deals with qualitative data (news and reports) and qualitative data (time series) EU Projects ACE (1996 -99), GIDA (2001 -03). ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Motivation v Market sentiment - quantifying effects of news in the Efficient Market Hypothesis? v Technicalists (chart patterns, stats) and fundamentalists (intrinsic -book- value) locked away from the outside world - no CNN? Challenge of treating multiple data sources v Bounded rationality (Simon 1972, Kahneman 2002)? v Self-deception of investors rejecting new evidence in favour of prior (incorrect) information (Lakonishk, Lee & Poteshman 2003, Kindlberger 2001) - e. g. “. com” bubble Buy/sell - human (re-)action is documented in the dataset ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
FINGRID methods/techniques v sentiment analysis: automatic terminology extraction; ontology learning; local grammars. v Learning the rules for Information Extraction (IE). v Patterns derived from a corpus (MB GB) of texts (arbitrary domain) v time series analysis (bootstrapping, wavelet analysis) v visualization of large volume time series and texts v Grid - Globus, Condor, OGSA-DAI, SRB ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
FINGRID Technologies 24 computers provide (dual-proc, hyperthreaded) 1. 2. 3. 4. Globus Toolkit 3. 0. 2 (GT 3), Java and FORTRAN software compilers, Java Commodity Grid kit (Cog. Kit), and Local security certification. FINGRID uses the Java Cog. Kit to integrate: (i) the MATLAB wavelet toolbox via JMat. Link; (ii) Reuters data via the Reuters SSL SDK; (iii) bootstrap simulation written in FORTRAN; and (iv) System Quirk components via the Quirk Java SDK. Condor (management of distributed processing – 76 procs in pool), Storage Resource Broker (Data Grids) also configured: expansion and testing in progress. ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Streaming Data (Reuters) FOREX (GBP/USD) tick data ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Streaming Data (Reuters) ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Streaming Data (Reuters) ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Streaming Data (Reuters) ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Datasets Numerical data Textual data Time series price/value movement of financial instruments; Text streams news items; financial reports; company brochures; government documents…. c. 5 MB/day, per instrument (XML) - including sources of quote (>1 GB/year/instrument) c. 40 MB/day (> 10 GB/year) HFDF data (O&A) n e. g. 5 minutes compression GBP/USD 1992 to 2003 inclusive; 1. 25 M datapoints (12*24*365*12) approximates 4 MB. Text corpora n RCV 1 (over 800000 news stories in 12 of 1996 -7 ); RCV 2 (13 languages) Copyrights/contracts? ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Non-stochastic? Encyclopedia of Chart patterns Japanese Candlestick Charting techniques If price increases, demand decreases? ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Bootstrap With many financial series, it may be difficult to select and fit an appropriate model; block bootstrap generates bootstrap samples from time series when a parametric model is not available. Block bootstrap is a procedure for generating bootstrap samples from time series when a parametric model is not available. The blocking procedure consists of dividing data into blocks and sampling blocks randomly with replacement. Bootstrap techniques are inherently computationally demanding, even using efficient computational algorithms (Nankervis 2002). The bootstrap can be iterated so that a further layer of resampling is performed (a double bootstrap): results in improved properties of estimators and test statistics. To make realistic statistical inferences from data using bootstrapping, significant replications (c. 10000 times) should be used (Lobato, Nankervis & Savin 2001). Other bootstrap-based procedures applicable to financial data include estimating the distribution of returns for Value at Risk (Va. R) models (Ruiz and Pascual, 2002). ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Bootstrap Bespoke FORTRAN implementation of bootstrapping [Nankervis] algorithm (Globus, Java Co. Gkit – Grid service) 1000 bootstrap replications: 2 nodes: 1050 seconds (17. 5 mins) 8 nodes: 404 seconds (6. 73 mins) 10000+ replications? Linear speedup? Hypothesis testing – dismiss bad ideas more quickly? ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Distributed bootstrapping Bootstrap is partially parallelizable: n Amdahl’s law: the fraction of code f, which cannot be parallelised, affects speedup factor - replication seeds, results. Condor and Condor DAGs (compose metalevel description) seed calculate results Job A seed. cmd Job B calculate 1. cmd Job C calculate 2. cmd Job D calculate 3. cmd calculate Job E calculate 4. cmd Job F results. cmd PARENT A CHILD B C D E PARENT B C D E CHILD F ASW on Quant. Methods in e-Social Science, 6 April 2005 executable = calculate. exe input = output = calculate. 1. out error = caculate. 1. err transfer_input_files = outs_aa transfer_files = ALWAYS log = calculate. 1. log arguments = outs_aa 250 queue Fingrid (RES-149 -25 -0028)
Wavelet analysis Conventional Signal Processing: • Variation in time-domain OR variation in frequency domain applicable to stationary series Wavelet-based Analysis: • Variation in time-domain AND variation in frequency domain applicable to non-stationary series. Aussem & Murtagh (1997) use wavelet analysis combined with neural networks to provide time series forecasts ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Wavelet Multiscale Analysis Fourier Power Spectra can be computed for each scale – discover cyclicals ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Wavelet analysis Most dominant cycle (brown rectified sine wave) has a period of 85. 3333 and starts at 57. 3333 Next dominant cycle (green rectified sine wave) has a period of 42. 6667 and starts at 41. 3333 Other cycles in order of their importance are 23. 27, 11. 6364, 5. 68 and 3. 12 UPTREND from 1 to 260 with a slope of 12. 57 and a y-intercept of 2626. 37 DOWNTREND from 261 to 358 with a slope of -8. 6956 and a y-intercept of 8166. 4091 The series loses its stationarity (variance change occurs) at 141 (black vertical line) Possible turning points (black circles): 68, 144 , 148, 152, 154, 165, 212, 220, 228, 260 298, 299, 300, 348, 358, and 358 Visualization ASW on Quant. Methods in e-Social Science, 6 April 2005 Textual Summary Fingrid (RES-149 -25 -0028)
Methods - Wavelet analysis Matlab toolboxes for Wavelet and Signal processing analysis Matlab -> JMat. Link (Java) -> Java Co. Gkit – Grid service Parallel/performance evaluation? JMat. Link engine = new JMat. Link(); engine. eng. Open(); eng. Eval. String("array=randn(500)"); public class TSAanalysis. Service. Grid. Locator extends org. globus. ogsa. impl. core. service. Service. Locator implements org. globus. ogsa. Grid. Locator { … array=eng. Get. Array("array"); engine. eng. Close(); ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Workflow Select instrument tick data Use sampling rule (OHLC) to create a time series [4 series, C at equally-spaced intervals] Use close series for n-scale Wavelet transform [nseries] Identify trends in low-frequency scale; apply Fourier analysis to each n-series to discover cycles Apply bootstrap to modelling individual series? Combination of model and trends = prediction? ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series Streaming news text Named entity identification (e. g. company name) Sentiment discovery (local grammars) Up/down series for market / company (qual -> quant? ) System Quirk JDK + Java Co. Gkit = Grid Service -> time series analysis -> covariance analysis ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series Patterns identified for Chinese also: “up” (上升) in Chinese 約/FPM 百分之八/MM ﹐ 億/MM 元/U ﹔ 上半年/NTN 地產/NN 投資/NN 收入/NN 上升/NN first half of this year, estate investment up 月/NTN 期/NN 指/VT 全/PA 日/NTN 收 /VT 報/NN 一萬一千三百/MM 點/U ﹐ 上升/VI 二十/MM 點/U ﹐ 低/A 水/NN 四十 五/MM 點/U ﹐ 成交/VT 合約/NN up 20 points, 45 points below average day-close value of the monthly index was 11300 points, 至/I 十九 about 8 percent, to 19 billion dollars ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Methods - Textual time series v Text Analysis v. Throughput tested with various sizes of corpora – against benchmark (wordlists – Hughes et al 2004) Time required to process one month’s news. RCV 1 takes about 95 minutes on 16 machines. Further experiments in progress ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Qual. meets Quant. Decision Matrix / probability of direction ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Qual. meets Quant. ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Qual. meets Quant. v FINGRID’s Sentiment and Time Series: Financial analysis system (SATISFI): for visualising and correlating the sentiment and instrument time series v Composition of Grid services ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
FINGRID -> qual. • System Quirk • text + terminology + ontology + local grammars + …. • Neural network classifiers (Hebbian networks, Websom) • Case-based and fuzzy reasoning • Automatic Text Summarisation • Text alignment • Metadata • ISO-standardized (ISO 11179 -3 conformant data registries - LIRICS project); application to text management (Virtual Corpora); Text Categorisation (+ Terminology lookup) • ISO 639 (codes for the names of languages); ISO 16642 (Terminology Markup Framework); LMF; MAF and other TLAs ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Recap v sentiment analysis: automatic terminology extraction; ontology learning; local grammars. v Learning the rules for Information Extraction (IE). v Patterns derived from a corpus (MB GB) of texts (arbitrary domain) v time series analysis (bootstrapping, wavelet analysis) v visualization of large volume time series and texts v Grid - Globus, Condor, OGSA-DAI, SRB ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Acknowledgements Saif Ahmad, Research Student, Wavelet Analysis; David Cheng, Research Officer, Text Analysis; Gary Dear, Computing Officer, Grid Implementation; Pensiri Manomaisupat, Research Student, Text Categorisation; Ademola Popoula, Research Student, Fuzzy Logic Analysis; Hayssam Trablousi, Research Student, Named Entity Extraction; Tuğba Taşkaya-Temizel, Tutor, Grid Computing, Grid Architect; Khurshid Ahmad, Principal Investigator; Jon Nankervis, Co-Investigator (Essex) ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Outlook Lessons from: Value at Risk Computation (Risk. Grid - Be. SC); Aircraft vibration time-series (DAME - York), …. Proposed activity on qual analysis (content analysis meets codebased); qual-quant integration/fusion? Integration with Sheffield’s GATE system complement and draw upon the work of e. SS PDPs and the existing nodes: text analysis (Nottingham), modelling & simulation (Leeds), mixed media (Bristol), and quantitative analysis (Lancaster). Additional activities (Surrey): EPSRC: REVEAL (autoannotation of crime-related CCTV); EU e. Content: LIRICS ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)
Further information http: //www. computing. surrey. ac. uk/grid/fingrid l. gillam@surrey. ac. uk ASW on Quant. Methods in e-Social Science, 6 April 2005 Fingrid (RES-149 -25 -0028)