Big Data Minder Chen Ph D Professor of

Big Data Minder Chen, Ph. D. Professor of MIS CSU Channel Islands minder. chen@csuci. edu

Benefits of Big Data http: //www. ibmbigdatahub. com/sites/default/files/infographic_file/4 -Vs-of-big-data. jpg © Minder Chen, 2017 Big Data - 2 -

Data Management https: //blogs. gartner. com/doug-laney/files/2012/01/ad 949 -3 D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety. pdf https: //blogs. gartner. com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/ © Minder Chen, 2017 Big Data - 3 -

With Big Data, We’ve Moved into a New Era of Analytics 12+ 5+ terabytes of Tweets create daily. trade events per second. Volume 100’s million Velocity Variety Veracity of different types of data. Only 1 in 3 decision makers trust their information. Ömer Sever (omers@tr. ibm. com) IBM SWG TR, Enterprise Content Management © Minder Chen, 2017 Big Data - 4 -

Four Characteristics of Big Data Cost efficiently processing the growing Volume 50 x 2010 Responding to the increasing Velocity 35 ZB 2020 Establishing the Veracity of big data sources 30 Billion RFID sensors and counting Collectively Analyzing the broadening Variety 80% of the worlds data is unstructured 1 in 3 business leaders don’t trust the information they use to make decisions Ömer Sever (omers@tr. ibm. com) IBM SWG TR, Enterprise Content Management © Minder Chen, 2017 Big Data - 5 -

Volume © Minder Chen, 2017 http: //www. ibmbigdatahub. com/infographic/four-vs-big-data Big Data - 6 -

Metric prefixes in everyday use Text Symbol Factor Power exa E 1000000000 1018 peta P 100000000 1015 tera T 1000000 1012 giga G 100000 109 mega M 1000000 106 kilo k 1000 103 micro μ 0. 000001 10− 6 nano n 0. 00001 10− 9 https: //en. wikipedia. org/wiki/Unit_prefix © Minder Chen, 2017 Big Data - 7 -

Variety http: //www. ibmbigdatahub. com/infographic/four-vs-big-data © Minder Chen, 2017 Big Data - 8 -

Velocity © Minder Chen, 2017 Big Data - 9 -

3 Vs of Big Data © Minder Chen, 2017 Big Data - 10 -

The 4 th V Veracity © Minder Chen, 2017 Big Data - 11 -

The 5 th V Value http: //www. ibmbigdatahub. com/infographic/extracting-business-value-4 -vs-big-data © Minder Chen, 2017 Big Data - 12 -

Relational Database Here a few reasons you might choose an SQL database: • You need to ensure ACID compliancy (Atomicity, Consistency, Isolation, Durability). ACID compliancy reduces anomalies and protects the integrity of your database by prescribing exactly how transactions interact with the database. Generally, No. SQL databases sacrifice ACID compliancy for flexibility and processing speed, but for many e-commerce and financial applications, an ACIDcompliant database remains the preferred option. • Your data is structured and unchanging. If your business is not experiencing massive growth that would require more servers and you’re only working with data that’s consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume. © Minder Chen, 2017 Big Data - 13 -

No. SQL Non SQL, Non Relational, Not only SQL No. SQL databases disrupted the database market by offering a more flexible, scalable, and less expensive alternative to relational databases. They also were built to better handle the requirements of Big Data applications. Examples: • Mango. DB https: //www. mongodb. com/nosql-explained https: //www. kidscodecs. com/database-design/ https: //www. analyticsvidhya. com/blog/2015/06/beginners-guide-mongodb/ © Minder Chen, 2017 Big Data - 14 -

No. SQL • Storing large volumes of data that often have little to no structure. A No. SQL database sets no limits on the types of data you can store together, and allows you to add different new types as your needs change. With document-based databases, you can store data in one place without having to define what “types” of data those are in advance. • Making the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution, but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software, and No. SQL databases like Cassandra are designed to be scaled across multiple data centers out of the box without a lot of headaches. • Rapid development. If you’re developing within two-week Agile sprints, cranking out quick iterations, or needing to make frequent updates to the data structure without a lot of downtime between versions, a relational database will slow you down. No. SQL data doesn’t need to be prepped ahead of time. © Minder Chen, 2017 Big Data - 15 -

Relational Database vs. No. SQL databases differ from relational DBs in 4 main areas: • Data models: A No. SQL database lets you build an application without having to define the schema first unlike relational databases which make you define your schema before you can add any data to the system. No predefined schema makes No. SQL databases much easier to update as your data and requirements change. • Data structure: Relational databases were built in an era where data was fairly structured and clearly defined by their relationships. No. SQL databases are designed to handle unstructured data (e. g. , texts, social media posts, video, email) which makes up much of the data that exists today. • Scaling: It’s much cheaper to scale a No. SQL database than a relational database because you can add capacity by scaling out over cheap, commodity servers. Relational databases, on the other hand, require a single server to host your entire database. To scale, you need to buy a bigger, more expensive server. • Development model: No. SQL databases are open source whereas relational databases typically are closed source with licensing fees baked into the use of their software. With No. SQL, you can get started on a project without any heavy investments in software fees upfront. © Minder Chen, 2017 https: //www. mongodb. com/scale/nosql-vs-relational-databases Big Data - 16 -

https: //www. upwork. com/hiring/data/sql-vs-nosql-databases-whats-the-difference/ https: //www. analyticsvidhya. com/blog/2015/06/beginners-guide-mongodb/ © Minder Chen, 2017 Big Data - 17 -

Data Model A user has friends who might be a user himself. People who have liked or commented or both can again be users themselves. This type of duplication makes it way harder to denormalize an activity stream into a single document. https: //www. analyticsvidhya. com/blog/2015/06/beginners-guide-mongodb/ © Minder Chen, 2017 Big Data - 18 -

Types of No. SQL • Key-value model—the least complex No. SQL option, which stores data in a schema-less way that consists of indexed keys and values. Examples: Cassandra, Azure, Level. DB, and Riak. • Column store—or, wide-column store, which stores data tables as columns rather than rows. It’s more than just an inverted table— sectioning out columns allows for excellent scalability and high performance. Examples: HBase, Big. Table, Hyper. Table. • Document database—taking the key-value concept and adding more complexity, each document in this type of database has its own data, and its own unique key, which is used to retrieve it. It’s a great option for storing, retrieving and managing data that’s document-oriented but still somewhat structured. Examples: Mongo. DB, Couch. DB. • Graph database—have data that’s interconnected and best represented as a graph? This method is capable of lots of complexity. Examples: Polyglot, Neo 4 J. https: //www. upwork. com/hiring/data/sql-vs-nosql-databases-whats-the-difference/ © Minder Chen, 2017 Big Data - 19 -

© Minder Chen, 2017 https: //highlyscalable. wordpress. com/2012/03/01/nosql-data-modeling-techniques/ Big Data - 20 -

Column Store • In a Column Store database, data is stored in columns, as contrast to being stored in rows as is done in most relational database management systems. • A Column Store is comprised of one or more Column Families that logically group specific columns of the database. A key is used to identify and point to a number of columns, with a keyspace attribute that defines the scope of this key. Each column contains tuples of names-values, ordered and comma separated. • Column Stores have fast read/write access to the information. Rows that correspond to a single column are stored as a single disk entry. This means faster access during read/write operations. • The most popular databases that use the column store include Google’s Big. Table, HBase, and Cassandra. © Minder Chen, 2017 Big Data - 21 -

RDBMS vs. No. SQL © Minder Chen, 2017 Big Data - 22 -

JSON • • JSON (Java. Script Object Notation) http: //json. org/example. html https: //www. w 3 schools. com/js/js_json_objects. asp Retrieving and Updating JSON Objects in SQL Server 2016 (link) { "name": "John", "age": 30, "cars": [ "Ford", "BMW", "Fiat" ] } © Minder Chen, 2017 { "name": "John", "age": 30, "cars": { "car 1": "Ford", "car 2": "BMW", "car 3": "Fiat" } } Big Data - 23 -