1ad0f665c16750ccb0f45daeba44c214.ppt
- Количество слайдов: 28
MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION METHOD By Wuri Wedyawati
Agenda n Introduction n Data Warehousing § Master. DW Data Warehouse n Predictive Regression § Real Estate Price Prediction n Conclusion
Introduction The object of this project is to develop a knowledge discovery system for prospective real estate sellers and buyers to determine their properties price, based on available sold listings in their areas. n The prediction of continuous values, such as properties selling price, is modeled by a statistical technique called predictive regression. n The suitable data warehouse design for this project is a star schema with one large fact table surrounded by many dimension tables. n This project uses Visual Basic. NET to create user interface. The communication between Oracle and. NET framework is established by adding Oracle Provider for OLE DB (Ora. OLEDB) component as reference in the project. n
Data Warehouse n Data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. n Building a data warehouse is the first step to do as the preparation for data mining.
Data Warehouse (cont. ) n Several processes in building a data warehouse: 1. Extraction 2. Transformation and cleansing 3. Modeling 4. Transport
1. Extraction n Document the sources of data § Identify the databases and files containing the data of interest § Analyze and document the business meaning of the data, data relationships and business rules n Determine how much data needs to be extracted § Extract all of subset of the data in the source system – Use unload utility – Use data manipulation language statement § Extract the changes made to the source data – Use a recovery log – Use a database trigger
2. Transformation and Cleansing n n Check the integrity of the source data to verify that it conforms to the business rules and relationships identified in extraction process. Check the accuracy of the source data. Identify the tasks and products required for data cleansing. Document the rules and identify the products required for transforming and integrating the data into the format required by the target system.
3. Modeling n Fact is a numeric (or other type of) data element by which an organization measures aspects of its business. Examples: dollar amounts (budget, expenditure, encumbrance, revenue, value of order), counts (headcount, credit hours, number of items) n Dimension is a set of attributes, usually hierarchical, that is used to describe an organization’s business by constraining and grouping facts. Examples: time, students, faculty, organization, funds, product.
Star Schema
4. Transport n Identify the products and techniques to be used for loading the data into the target system § SQL*Loader utility (for flat file data) § Transportablespaces (for Oracle database) n Evaluate the need for data compression and encryption if captured or transformed data is to be transported across a network
Master. DW Data Warehouse RESI. TXT (Data Source) Transformation and Cleansing RESSOLDLOG. TXT (Log File) Update RES. TXT Transformation and Cleansing 2 OFCSRC. TXT AGTSRC. TXT Duplicate Detection OFFICE. TXT AGENT. TXT Load OFFICES TABLE RESIDENTIAL. TXT Load AGENTS TABLE AREA. TXT Load RESIDENTIAL TABLE AREA TABLE
Master. DW Extraction The operational data source is extracted from Sacramento, El Dorado, Placer, and Yolo Counties Multiple Listings Services (MLS) database. n The data source is from the residential MLS database backup that is scheduled in January 9, 2004. It captures all the residential data in the source system since January 1, 1998 until January 9, 2004. n The source data is in the “|” delimited flat file and contains of 191 fields and 295787 rows (“RESI. TXT”). n
Master. DW Transformation and Cleansing There are four steps of transformation and cleansing in this project: 1. Transformation and cleansing 1 2. Update process for the result of transformation and cleansing 1 3. Transformation and cleansing 2 4. Duplication detection for office and agent records
1. Transformation and Cleansing 1 n Listing Price Check If int. LP <= 0 Or int. LP > 9999 Then LPCheck = str. MLSNo & " : INVALID LP = " & Str(int. LP) Else. If int. LP < 10000 Or int. LP > 50000000 Then LPCheck = str. MLSNo & " : LP EXCEEDS LIMIT = " & Str(int. LP) End If n Square Footage Check If int. SQFT = 0 Then SQFTCheck = str. MLSNo & " : SQFT IS NULL = " & Str(int. SQFT) Else. If int. SQFT > 10000 And int. LP < 1000000 Then SQFTCheck = str. MLSNo & " : SQFT EXCEEDS LIMIT = " & Str(int. SQFT) End If
1. Transformation and Cleansing 1 (cont. ) n Listing Date Check If str. LD = "0000 -00 -00" Or Len(str. LD) < 8 Then LDCheck = str. MLSNo & " : INVALID LD = " & str. LD Else. If Date. Value(str. LD) < "1900 -01 -01" Then LDCheck = str. MLSNo & " : LD EXCEEDS LIMIT = " & str. LD End If n Number Check of Full Bathroom and Half Bathroom If int. Full <= 0 Then Bath. Check = str. MLSNo & " : NO FULL BATHROOM = " & Str(int. Full) & " AND " & Str(int. Half) End If
1. Transformation and Cleansing 1 (cont. ) n Number of Bedroom Check If int. Bed <= 0 Then Bed. Check = str. MLSNo & " : NO BEDROOM = " & Str(int. Bed) End If n Year Built Check If Len(str. Year. Blt) = 0 Then Year. Blt. Check = str. MLSNo & " : NO YEAR BUILT = " & str. Year. Blt Else. If Val(str. Year. Blt) <= 1900 Then Year. Blt. Check = str. MLSNo & " : INVALID YEAR BUILT = " & str. Year. Blt End If
1. Transformation and Cleansing 1 (cont. ) n n Pending Date Check If str. SD = "0000 -00 -00" Or Len(str. SD) < 8 Then SDCheck = str. MLSNo & " : INVALID SD = " & str. SD Else. If Len(LDCheck(str. MLSNo, str. LD)) = 0 And Date. Value(str. SD) < Date. Value(str. LD) Then SDCheck = str. MLSNo & " : SD / LD = " & str. SD & " / " & str. LD Else. If Len(PDCheck(str. MLSNo, str. PD, str. LD)) = 0 And Date. Value(str. SD) < Date. Value(str. PD) Then SDCheck = str. MLSNo & " : SD / PD = " & str. SD & " / " & str. PD Else. If Date. Value(str. SD) < "1990 -01 -01" Then SDCheck = str. MLSNo & " : SD EXCEEDS LIMIT = " & str. SD End If Days on Market Check If int. SP <= 0 Or int. SP > 9999 Then SPCheck = str. MLSNo & " : INVALID SP = " & Str(int. SP) Else. If int. SP < 10000 Or int. SP > 50000000 Then SPCheck = str. MLSNo & " : SP EXCEEDS LIMIT = " & Str(int. SP) Else. If Len(LPCheck(str. MLSNo, int. LP)) = 0 And (int. SP < int. LP / 1. 5 Or int. SP > int. LP * 1. 5) Then SPCheck = str. MLSNo & " : LP / SP EXCEEDS NORM = " & int. LP & " / " & int. SP End If
1. Transformation and Cleansing 1 (cont. ) n Sold Date Check If str. PD = "0000 -00 -00" Or Len(str. PD) < 8 Then PDCheck = str. MLSNo & " : INVALID PD = " & str. PD Else. If Len(LDCheck(str. MLSNo, str. LD)) = 0 And Date. Value(str. PD) < Date. Value(str. LD) Then PDCheck = str. MLSNo & " : PD IS LESS THAN LD => PD = " & str. PD & " & LD = " & str. LD Else. If Date. Value(str. PD) < "1990 -01 -01" Then PDCheck = str. MLSNo & " : PD EXCEEDS LIMIT = " & str. PD End If n Sold Price Check If (LDCheck(str. MLSNo, str. LD)) = 0 And Len(PDCheck(str. MLSNo, str. PD, str. LD)) = 0 And Date. Diff(Date. Interval. Day, Date. Value(str. PD), Date. Value(str. LD)) > 730 Then DOMCheck = str. MLSNo & " : DOM TOO LARGE = " & Date. Diff(Date. Interval. Day, Date. Value(str. PD), Date. Value(str. LD)) End If
2. Update Process for the Result of Transformation and Cleansing 1 n n n 132110169 : LP EXCEEDS LIMIT = 132 (132000) 30015346 : SQFT EXCEEDS LIMIT = 12700 (1270) 30015611 : LD EXCEEDS LIMIT = 1920 -05 -07 (2000 -05 -07) 30015755 : NO FULL BATHROOM = 0 AND 3 (3 AND 0) 102100090 : INVALID YEAR BUILT = 96 (1996) 30028591 : INVALID YEAR BUILT = 1056 (1956) 102000035 : PD IS LESS THAN LD => PD = 2000 -03 -30 & LD = 2020 -01 -26 (2000 -01 -26) 102000035 : SD / LD = 2000 -05 -31 / 2020 -01 -26 (2000 -01 -26) 122003643 : SD / PD = 2000 -11 -01 / 2000 -11 -05 (2000 -11 -05 / 2000 -11 -01) 132001727 : SP EXCEEDS LIMIT = 124 (124000) 30016715 : LP / SP EXCEEDS NORM = 226000 / 22600000 (226000)
3. Transformation and Cleansing 2 n The agents table (“AGTSRC. TXT”) fields: Listing agent id, Listing agent name, Listing agent phone 1, Listing agent phone 2, Listing agent phone 3, Listing agent phone type 1, Listing agent phone type 2, Listing agent phone type 3, Listing office id, Listing co-agent name, Listing co-agent phone, Listing co-office id, Selling agent name, Selling agent phone, Selling office id, Selling co-agent name, Selling co-office id. Example: “SREIDMAR|Marjorie Reid|916 -485 -5124|916 -4855124||1|2||LYON 01”
3. Transformation and Cleansing 2 (cont. ) n The offices table (“OFCSRC. TXT”) fields: Listing office id, Listing office name, Listing office phone, Listing office address, Listing office zip, Listing co-office id, Listing cooffice name, Listing co-office phone, Selling office id, Selling office name, Selling office phone, Selling co-office id, Selling cooffice name. Example: “LYON 01|Lyon Real Estate|916 -481 -3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|Sacramento”
3. Transformation and Cleansing 2 (cont. ) n The areas table (“AREA. TXT”) fields: Area number, area name, county. Example: “ 10819|East Sacramento & Vicinity|Sacramento County”
3. Transformation and Cleansing 2 (cont. ) n The residential table (“RESIDENTIAL. TXT”) fields: Example: “ 15501835|2367|Glen Ellen|95822|Sacramento|10822 ||Thomas Bros. (PL, SA)|317 D-5|035 -0132 -0120000|Residential|Sold|22 -Jan-95|28 -Jan-95|20 -Apr 99|01 -Jan-00|20 -May-99|01 -Jan-00|1549|1700 |764705. 9|17. 56|2|1|4|130000|159500|SGREENCA ||GCNA||130000|SSTANWIL||CLBA 20||1959|3|FHA |Sacramento Unified|Sacramento Unified”
4. Duplication and Detection for Agent and Office Records n “AGTSRC. TXT” contains duplicate records. An agent can be a selling agent, a buyer agent, or both in a listing. An agent can have more than one listing in “RES. TXT”. Example: “SAKBARIR|Rouhi N. Akbari|916 -484 -5456|916 -2237647||1|C||LYON 01” n “OFCSRC. TXT” contains duplicate records. An office can be a selling office, a buyer office, or both in a listing. An office can have more than one listing in “RES. TXT”. Example: “LYON 01|Lyon Real Estate|916 -481 -3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|Sacramento”
Master. DW Modeling AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table
Master. DW Transport n Load “AREA. TXT” to AREAS dimension table c: >sqlldr masterdw/masterdw control=area. ctl log=area. log n Load “OFFICE. TXT” to OFFICE dimension table c: >sqlldr masterdw/masterdw control=office. ctl log=office. log n Load “AGENT. TXT” to AGENTS dimension table c: >sqlldr masterdw/masterdw control=agent. ctl log=agent. log n Load “RESIDENTIAL. TXT” to RESIDENTIAL dimension table c: >sqlldr masterdw/masterdw control=residential. ctl log=residential. log
Predictive Regression Predictive regression is regression that uses continuous values in the data set to predict unknown or future values of other variables of interest. The objective of regression analysis is to determine the best model that can relate the output variable to various input variables. n n i=1 β = [ ∑ (xi – meanx). (yi – meany) ] / [ ∑ (xi – meanx)2 ] α = meany - β. meanx y = α + β. x
Conclusion There are several things that can be learned from this project: § Real estate terms and transaction process § Building a data warehouse using Oracle data warehousing tools § Statistical data analysis (predictive regression method) § Visual Basic. NET programming § Oracle Provider for OLE DB (ORAOLEDB) n Future Work: § Enhance this project by making it online. § Determine what kind of house improvement that a real estate seller can do to increase his/her house price on the market. n


