Скачать презентацию MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING Скачать презентацию MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING

8899a4a3b81cfd600933df7ba9e3780f.ppt

  • Количество слайдов: 33

MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION Wuri Wedyawati, Meiliu MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION Wuri Wedyawati, Meiliu Lu Department of Computer Science California State University Sacramento, CA 95819 -6021 mlu@csus. edu

Outline n n n Introduction Data Warehousing § Building a data warehouse § Master. Outline n n n Introduction Data Warehousing § Building a data warehouse § Master. DW: the data warehouse Predictive Regression § Real Estate Price Prediction Conclusion Future work

Introduction n The objective is to develop a knowledge discovery system for prospective real Introduction n The objective is to develop a knowledge discovery system for prospective real estate sellers and buyers to determine their properties price based on local sold listings. n The prediction of properties selling price, is modeled by predictive regression. n Building a data warehouse is a prerequisite for efficient mining of large and operational data like Multiple Listings Services (MLS) – data source for this system.

Data Warehouse A decision support database that is maintained separately from the organization’s operational Data Warehouse A decision support database that is maintained separately from the organization’s operational database. n Support decision-making by providing a platform of consolidated, historical data for analysis. n Our data warehouse is based on a multidimensional data model called star schema with one large fact table surrounded by a set of dimension tables. n

Data Warehousing n Process of building a data warehouse: 1. Extraction 2. Transformation and Data Warehousing n Process of building a data warehouse: 1. Extraction 2. Transformation and cleansing 3. Modeling 4. Transport

1. Extraction n Document the sources of data § Identify the databases and files 1. Extraction n Document the sources of data § Identify the databases and files containing the data of interest § Analyze and document the business meaning of the data, data relationships and business rules n Determine data that need to be extracted § Extract all of subset of the data from the source – Use unload utility – Use data manipulation language statement § Extract the changes made to the source data – Use a recovery log – Use a database trigger

2. Transformation and Cleansing n n Check the integrity of the source data to 2. Transformation and Cleansing n n Check the integrity of the source data to verify that it conforms to the business rules and relationships identified in extraction step. Check the accuracy of the source data. Identify the tasks required for data cleansing. Transform and integrate the cleaned data into the format required by the target system – data warehouse.

3. Modeling n Star Schema shows data as a collection of two types: facts 3. Modeling n Star Schema shows data as a collection of two types: facts and dimensions. n A Fact table is the primary table in a dimensional model and it contains the names of the facts or numerical measures, as well as keys to each of the related dimension tables. Examples of facts: sales, credit cards accounts, residential records. n A Dimension table is used to describe a specific dimension with a set of attributes. Examples of dimensions: time, students, areas.

An Example Star Schema An Example Star Schema

Master. DW Modeling AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Master. DW Modeling AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table

4. Transport n Identify the tools and techniques to be used for loading the 4. Transport n Identify the tools and techniques to be used for loading the data into the target system § SQL*Loader utility (for flat file data) § Transportablespaces (for Oracle database) n Evaluate the need for data compression and encryption if captured or transformed data is to be transported across a network

Master. DW Data Warehousing RESI. TXT (Data Source) Transformation and Cleansing RESSOLDLOG. TXT (Log Master. DW Data Warehousing RESI. TXT (Data Source) Transformation and Cleansing RESSOLDLOG. TXT (Log File) Update RES. TXT Transformation and Cleansing 2 OFCSRC. TXT AGTSRC. TXT Duplicate Detection OFFICE. TXT AGENT. TXT Load OFFICES TABLE RESIDENTIAL. TXT Load AGENTS TABLE AREA. TXT Load RESIDENTIAL TABLE AREA TABLE

Master. DW Extraction The operational data source is extracted from Sacramento, El Dorado, Placer, Master. DW Extraction The operational data source is extracted from Sacramento, El Dorado, Placer, and Yolo Counties Multiple Listings Services (MLS) database. n It captures all the residential data in the source system since January 1, 1998 until January 9, 2004. n The source data is in the “|” delimited flat file and contains of 191 fields and 295787 rows (“RESI. TXT”). n

Master. DW Transformation and Cleansing There are four steps : 1. Transformation and cleansing Master. DW Transformation and Cleansing There are four steps : 1. Transformation and cleansing 1 2. Update process for the result of transformation and cleansing 1 3. Transformation and cleansing 2 4. Duplication detection for office and agent records

1. Transformation and Cleansing 1 n Listing Price Check If int. LP <= 0 1. Transformation and Cleansing 1 n Listing Price Check If int. LP <= 0 Or int. LP > 9999 Then LPCheck = str. MLSNo & " : INVALID LP = " & Str(int. LP) Else. If int. LP < 10000 Or int. LP > 50000000 Then LPCheck = str. MLSNo & " : LP EXCEEDS LIMIT = " & Str(int. LP) End If n Square Footage Check If int. SQFT = 0 Then SQFTCheck = str. MLSNo & " : SQFT IS NULL = " & Str(int. SQFT) Else. If int. SQFT > 10000 And int. LP < 1000000 Then SQFTCheck = str. MLSNo & " : SQFT EXCEEDS LIMIT = " & Str(int. SQFT) End If

1. Transformation and Cleansing 1 (cont. ) n Listing Date Check If str. LD 1. Transformation and Cleansing 1 (cont. ) n Listing Date Check If str. LD = "0000 -00 -00" Or Len(str. LD) < 8 Then LDCheck = str. MLSNo & " : INVALID LD = " & str. LD Else. If Date. Value(str. LD) < "1900 -01 -01" Then LDCheck = str. MLSNo & " : LD EXCEEDS LIMIT = " & str. LD End If n Number Check of Full Bathroom and Half Bathroom If int. Full <= 0 Then Bath. Check = str. MLSNo & " : NO FULL BATHROOM = " & Str(int. Full) & " AND " & Str(int. Half) End If

1. Transformation and Cleansing 1 (cont. ) n Number of Bedroom Check If int. 1. Transformation and Cleansing 1 (cont. ) n Number of Bedroom Check If int. Bed <= 0 Then Bed. Check = str. MLSNo & " : NO BEDROOM = " & Str(int. Bed) End If n Year Built Check If Len(str. Year. Blt) = 0 Then Year. Blt. Check = str. MLSNo & " : NO YEAR BUILT = " & str. Year. Blt Else. If Val(str. Year. Blt) <= 1900 Then Year. Blt. Check = str. MLSNo & " : INVALID YEAR BUILT = " & str. Year. Blt End If

1. Transformation and Cleansing 1 (cont. ) n n Pending Date Check If str. 1. Transformation and Cleansing 1 (cont. ) n n Pending Date Check If str. SD = "0000 -00 -00" Or Len(str. SD) < 8 Then SDCheck = str. MLSNo & " : INVALID SD = " & str. SD Else. If Len(LDCheck(str. MLSNo, str. LD)) = 0 And Date. Value(str. SD) < Date. Value(str. LD) Then SDCheck = str. MLSNo & " : SD / LD = " & str. SD & " / " & str. LD Else. If Len(PDCheck(str. MLSNo, str. PD, str. LD)) = 0 And Date. Value(str. SD) < Date. Value(str. PD) Then SDCheck = str. MLSNo & " : SD / PD = " & str. SD & " / " & str. PD Else. If Date. Value(str. SD) < "1990 -01 -01" Then SDCheck = str. MLSNo & " : SD EXCEEDS LIMIT = " & str. SD End If Days on Market Check If int. SP <= 0 Or int. SP > 9999 Then SPCheck = str. MLSNo & " : INVALID SP = " & Str(int. SP) Else. If int. SP < 10000 Or int. SP > 50000000 Then SPCheck = str. MLSNo & " : SP EXCEEDS LIMIT = " & Str(int. SP) Else. If Len(LPCheck(str. MLSNo, int. LP)) = 0 And (int. SP < int. LP / 1. 5 Or int. SP > int. LP * 1. 5) Then SPCheck = str. MLSNo & " : LP / SP EXCEEDS NORM = " & int. LP & " / " & int. SP End If

1. Transformation and Cleansing 1 (cont. ) n Sold Date Check If str. PD 1. Transformation and Cleansing 1 (cont. ) n Sold Date Check If str. PD = "0000 -00 -00" Or Len(str. PD) < 8 Then PDCheck = str. MLSNo & " : INVALID PD = " & str. PD Else. If Len(LDCheck(str. MLSNo, str. LD)) = 0 And Date. Value(str. PD) < Date. Value(str. LD) Then PDCheck = str. MLSNo & " : PD IS LESS THAN LD => PD = " & str. PD & " & LD = " & str. LD Else. If Date. Value(str. PD) < "1990 -01 -01" Then PDCheck = str. MLSNo & " : PD EXCEEDS LIMIT = " & str. PD End If n Sold Price Check If (LDCheck(str. MLSNo, str. LD)) = 0 And Len(PDCheck(str. MLSNo, str. PD, str. LD)) = 0 And Date. Diff(Date. Interval. Day, Date. Value(str. PD), Date. Value(str. LD)) > 730 Then DOMCheck = str. MLSNo & " : DOM TOO LARGE = " & Date. Diff(Date. Interval. Day, Date. Value(str. PD), Date. Value(str. LD)) End If

2. Update Process for the Result of Transformation and Cleansing 1 n n n 2. Update Process for the Result of Transformation and Cleansing 1 n n n 132110169 : LP EXCEEDS LIMIT = 132 (132000) 30015346 : SQFT EXCEEDS LIMIT = 12700 (1270) 30015611 : LD EXCEEDS LIMIT = 1920 -05 -07 (2000 -05 -07) 30015755 : NO FULL BATHROOM = 0 AND 3 (3 AND 0) 102100090 : INVALID YEAR BUILT = 96 (1996) 30028591 : INVALID YEAR BUILT = 1056 (1956) 102000035 : PD IS LESS THAN LD => PD = 2000 -03 -30 & LD = 2020 -01 -26 (2000 -01 -26) 102000035 : SD / LD = 2000 -05 -31 / 2020 -01 -26 (2000 -01 -26) 122003643 : SD / PD = 2000 -11 -01 / 2000 -11 -05 (2000 -11 -05 / 2000 -11 -01) 132001727 : SP EXCEEDS LIMIT = 124 (124000) 30016715 : LP / SP EXCEEDS NORM = 226000 / 22600000 (226000)

3. Transformation and Cleansing 2 n The agents table (“AGTSRC. TXT”) fields: Listing agent 3. Transformation and Cleansing 2 n The agents table (“AGTSRC. TXT”) fields: Listing agent id, Listing agent name, Listing agent phone 1, Listing agent phone 2, Listing agent phone 3, Listing agent phone type 1, Listing agent phone type 2, Listing agent phone type 3, Listing office id, Listing co-agent name, Listing co-agent phone, Listing co-office id, Selling agent name, Selling agent phone, Selling office id, Selling co-agent name, Selling co-office id. Example: “SREIDMAR|Marjorie Reid|916 -485 -5124|916 -4855124||1|2||LYON 01”

3. Transformation and Cleansing 2 (cont. ) n The offices table (“OFCSRC. TXT”) fields: 3. Transformation and Cleansing 2 (cont. ) n The offices table (“OFCSRC. TXT”) fields: Listing office id, Listing office name, Listing office phone, Listing office address, Listing office zip, Listing co-office id, Listing cooffice name, Listing co-office phone, Selling office id, Selling office name, Selling office phone, Selling co-office id, Selling cooffice name. Example: “LYON 01|Lyon Real Estate|916 -481 -3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|Sacramento”

3. Transformation and Cleansing 2 (cont. ) n The areas table (“AREA. TXT”) fields: 3. Transformation and Cleansing 2 (cont. ) n The areas table (“AREA. TXT”) fields: Area number, area name, county. Example: “ 10819|East Sacramento & Vicinity|Sacramento County”

3. Transformation and Cleansing 2 (cont. ) n The residential table (“RESIDENTIAL. TXT”) fields: 3. Transformation and Cleansing 2 (cont. ) n The residential table (“RESIDENTIAL. TXT”) fields: Example: “ 15501835|2367|Glen Ellen|95822|Sacramento|10822 ||Thomas Bros. (PL, SA)|317 D-5|035 -0132 -0120000|Residential|Sold|22 -Jan-95|28 -Jan-95|20 -Apr 99|01 -Jan-00|20 -May-99|01 -Jan-00|1549|1700 |764705. 9|17. 56|2|1|4|130000|159500|SGREENCA ||GCNA||130000|SSTANWIL||CLBA 20||1959|3|FHA |Sacramento Unified|Sacramento Unified”

4. Duplication and Detection for Agent and Office Records n “AGTSRC. TXT” contains duplicate 4. Duplication and Detection for Agent and Office Records n “AGTSRC. TXT” contains duplicate records. An agent can be a selling agent, a buyer agent, or both in a listing. An agent can have more than one listing in “RES. TXT”. Example: “SAKBARIR|Rouhi N. Akbari|916 -484 -5456|916 -2237647||1|C||LYON 01” n “OFCSRC. TXT” contains duplicate records. An office can be a selling office, a buyer office, or both in a listing. An office can have more than one listing in “RES. TXT”. Example: “LYON 01|Lyon Real Estate|916 -481 -3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|Sacramento”

Master. DW Modeling: Ready to load the clean data into the 4 tables AGENTS Master. DW Modeling: Ready to load the clean data into the 4 tables AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table

Master. DW Transport n Load “AREA. TXT” to AREAS dimension table c: >sqlldr masterdw/masterdw Master. DW Transport n Load “AREA. TXT” to AREAS dimension table c: >sqlldr masterdw/masterdw control=area. ctl log=area. log n Load “OFFICE. TXT” to OFFICE dimension table c: >sqlldr masterdw/masterdw control=office. ctl log=office. log n Load “AGENT. TXT” to AGENTS dimension table c: >sqlldr masterdw/masterdw control=agent. ctl log=agent. log n Load “RESIDENTIAL. TXT” to RESIDENTIAL dimension table c: >sqlldr masterdw/masterdw control=residential. ctl log=residential. log

Predictive Regression Predictive regression is regression that uses continuous values in the data set Predictive Regression Predictive regression is regression that uses continuous values in the data set to predict unknown or future values of other variables of interest. The objective of regression analysis is to determine the best model that can relate the output variable to various input variables. n n i=1 β = [ ∑ (xi – meanx). (yi – meany) ] / [ ∑ (xi – meanx)2 ] α = meany - β. meanx y = α + β. x

Regression: input and output n Input Data: X, β, α to be determined by Regression: input and output n Input Data: X, β, α to be determined by query selection result from Master. DW based on user request parameters § example: § “Select * from Residential where (Status = ‘Sold’) and (Area_Number) = ‘ 10835’ and (Square_Footage between ‘ 2000’ and ‘ 3000’) and (Bedrooms = ‘ 4’) and (Bathrooms_Full = ‘ 2’) and (Bathrooms_Half = ‘ 0’) and (Year_Built = ‘ 2001’)” Assumption: Bull housing market n Output result: Y predicted house price today n

Interface n Interface: (user => Master. DW => Predictive Regression => user) n Visual Interface n Interface: (user => Master. DW => Predictive Regression => user) n Visual Basic. NET is used to create user interface. n The communication between Oracle and. NET framework is established by adding Oracle Provider for OLE DB (Ora. OLEDB) component as reference.

Conclusion understand the knowledge domain: Real estate terms and transaction process n Technology used: Conclusion understand the knowledge domain: Real estate terms and transaction process n Technology used: § Building a data warehouse using Oracle data warehousing tools § Statistical data analysis (predictive regression method) § Visual Basic. NET programming § Oracle Provider for OLE DB (ORAOLEDB) n

Future Work § Towards tightly coupling data mining architecture. § Enhance this project by Future Work § Towards tightly coupling data mining architecture. § Enhance this project by making it an online service for public. § Integrate current market trend factor § Determine what kind of house improvement that a real estate seller can do to increase property value on the market.