RDBMS Fundamentals Indexing Relational data structures Indexes examples

Objectives Understand the benefits and costs of indexes. Understand how indexes are implemented. Identify the different index structures: – Understand the B+ tree. Understand features and guidelines to use: – Implicit versus Explicit indexes. – Unique versus Duplicate indexes. – Simple versus Composite indexes. – Cluster indexes. – Functional indexes. Understand decision criteria for Index storage: – Attached versus Detached indexes. – Index fragmentation. Select an appropriate fill factor for an index. Know how to alter, drop, rename and maintain an index. Understand how Indexing works on IBM Informix. 2 © 2011 IBM Corporation

Sequential Scans A Sequential Scan or Table Scan: – Reads all the pages that belong to the table, returns all the rows of the table. – Starts with the first page of the table, and orderly travels across all the devices that contain pages of the table until retrieve the last page. – Sequential scan of a large table is an expensive operation in OLTP systems. • I/O is expensive (mechanic) operation. • Also fills up the memory buffers with unnecessary pages, affecting the buffer profiling and the performance of the database server. – Small sequential scans (seq scans in small tables) are acceptable in OLTP. 3 © 2011 IBM Corporation

When do Sequential Scans work well? In any case of access (random or sequential) on very small tables: – The table fits in just a few pages and can be accessed with minimum I/O. Reporting-type of queries, where all the rows of the table are needed – Random-access is not an option because the whole table is needed to process the results: • Example: Queries using aggregate functions (max, min, avg, sum) and no filters. – Result set is large enough or non-selective enough, so scanning the whole table is cheaper than randomly access every row: • Non-restrictive filters using low cardinality columns or low selectivity filters. 4 © 2011 IBM Corporation

Sequential Scans – The need for indexes (1) Imagine this scenario: Filter rows in a table based on a condition: – Lookup case, search for rows meeting a restrictive condition (filter). – Just a few rows meet the condition (e. g. <= 5% of total data). – The filter will retrieve scattered rows, not placed together in disk. We should NOT use a sequential scan to do a random or nonsequential access with high-selectivity filters: – The bad performance will be more evident as we use a more restrictive / selective filter (for instance, on columns with high-cardinality). 5 © 2011 IBM Corporation

Sequential Scans – The need for indexes (2) Imagine this scenario: Join between two tables, at least one is large – The tables are related in just a few rows (equi-join). The iterations in a nested loop join scenario will severely decrease the performance – The # of access reads will be multiplied by the rows sequentially scanned. – The performance depends on the order taken for the tables in the join and whether or not an index will be used in any of the tables. 6 © 2011 IBM Corporation

Indexes – The Basics (1) An index is a structure or object in the database. Database indexes are similar to indexes in books or file cabinets. Indexes provide fast access to rows in tables meeting certain condition in a query. Minimize I/O, improving performance, specially in random access. It is a dynamic structure: Changes as the data in the table changes. A table can have several indexes, to satisfy several queries. 7 © 2011 IBM Corporation

Indexes –The Basics (2) Unique indexes are necessary on column(s) that must be unique. Presence of an index can allow the optimizer to speed up a query. The optimizer decides whether to use an index or not – We can force the optimizer to use or avoid an index. The optimizer can use an index in the following ways: – To replace sequential table scans with nonsequential/random access. – To avoid reading row data when processing expressions that name only indexed columns. – To avoid a sort (including building a temporary table) when executing the GROUP BY and ORDER BY clauses. An index on the appropriate column can save thousands, tens of thousands, or in extreme cases, even millions of disk operations during a query. Indexes entail costs (in space, processing and maintenance). 8 © 2011 IBM Corporation

Query Speed Comparison: Seq scan vs Index scan Example: Suppose you have two tables: – tab 1: with 200 rows (small table), and – tab 2: with 500, 000 rows (large table) • Both with a unique index on the joining column. – Assume 1 row per page, therefore 1 data read (I/O) needed per row: • For tab 1: Assume it takes 2 index reads (+ 1 data read per row) => 3 I/Os For tab 2: Assume it takes 3 index reads (+ 1 data read per row) => 4 I/Os. – A simple select to find related rows from the two tables is: – SELECT * FROM tab 1, tab 2 where tab 1. col 1=tab 2. col 2 Results: Depending on the database optimizer’s decision – How many I/O (disk) accesses (index + data reads) needed? • If select from the tab 1 first and then joins to tab 2 using the index: – 1, 000 disk reads = 200 for tab 1 + 200 x 4 for tab 2. • If selects from tab 2 first, and then joins to tab 1 using the index: – 2 million disk reads! = 500, 000 for tab 2 + 500, 000 x 3 for tab 1. • If no indexes used at all: Sequential scan in both tables: – around 1 billion disk reads! = 200 for tab 1 x 500, 000 for tab 2. 9 © 2011 IBM Corporation

Typical Index Structures 10 Binary Search Tree (BST) B-Tree B+ Tree: DB World’s favorite

B-Tree (1) – Analogy with a File Cabinet: Organized storage 11 B-Tree index ©

B-Tree (2) Allows several keys in one index page / node – Keeps similar valued records together on a disk page. – This takes advantage of locality of reference over binary search trees (BST). Allows nodes to be incompletely filled – Every node in the tree will be full at least to a certain minimum percentage. – Some space is wasted (nodes are not entirely full). – Requires less re-balancing than binary search trees (BST). – Reduces the number of disk fetches necessary during a search. 12 © 2011 IBM Corporation

B Tree (3) They are always in perfect balance, as all leaf nodes are at the same depth. Keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. Unlike self-balancing BST, it is optimized for systems that read and write large blocks of data. Grows from the bottom: When a node is over-full, it is split and the added node is put one level up. Deletions are the reverse of additions. 13 © 2011 IBM Corporation

B-Tree (4) In B-trees: – The root node points down to branch nodes. – Branch nodes point down to leaf nodes. – Leaf nodes address the actual data rows. But: B-tree has a problem – It is inefficient to search for a range of data values. – Not able to be used to move laterally through the tree, as well as up and down: • If all adjacent sibling nodes were connected, you could scan an entire tree level with minimum effort. This is done by B+ trees, used by some of the major database vendors. 14 © 2011 IBM Corporation

B-plus Tree (B+ tree) (1) It is an improved B-tree. The preferred index structure: – Widely used in databases. Used in Informix. Levels: – The topmost level of the index structure hierarchy contains a single root page. – Each branch page has entries that point to pages in the next level of the index and to the immediate peer nodes (next nodes at the same level). – Each leaf page contains a list of index entries that point to rows in the table. Index entries in leaf pages/nodes are sorted in key-value order: – An Index entry consists of a key and one or more row pointers. – The key is a copy of the indexed columns from one row of data. – A row pointer provides an address used to locate a row that contains the key (rowid). A sample b+ tree where only forward scan is available 15 © 2011 IBM Corporation

B-plus Tree (B+ tree) (2) – Informix In Informix: – B+ tree index structures. – An index leaf is formed of: • A key value (the index key). • Row ID’s to find the data row(s). • A delete flag. – The Index root and branch nodes: • Similar to the leaves, but instead of rowids, they have the IDs of other index nodes (pointers to other nodes). – Each Index node/page is doubledlinked with its previous and next index peer node, to optimize: • Key-only and Range searches. • Both Forward and Reverse Order searches. An index key contains the value of the column we want to search for: – e. g. : first_name, SSN, zip_code. 16 © 2011 IBM Corporation

B+Tree (3) – Informix In B+ trees: – The root node points down to branch nodes. – Branch nodes point down, left and right. – Leaf nodes point left, right, and down to data via rowids. The ability to move right or left from a node to its adjacent node puts the plus in B+ tree: – All Informix indexes for non-multidimensional data are B+ trees. 17 © 2011 IBM Corporation

Forest of Trees Index A larger B-Tree index divided into smaller subtrees called “buckets. ” You define which columns are used to hash to a “bucket. ” You define the number of “buckets. ” Each “bucket” acts as sort of a mini index. Traditional Btree Index: – create unique index security_idx on l SECURITY( S_SYMB, S_CO_ID ) in dbs; Forest of Trees Index: – create unique index security_idx on SECURITY( S_SYMB, S_CO_ID ) in dbs hash on ( S_SYMB ) with 1000 buckets; 18 © 2011 IBM Corporation

Traditional B-Tree Index Key Organization 19 © 2011 IBM Corporation

Forest of Trees Index Key Organization 20 © 2011 IBM Corporation

Why Use a Forest of Trees Index? 21 Reduces rootnode contention on small to medium size tables. Reduces the B-Tree Index depth on larger tables: – Reducing a 4 -level to a 3 -level B-Tree reduces buffer reads 25 percent. onstat -g spi output shows spin locks with large wait numbers and loops: Num Waits Num Loops Avg Loop/Wait Name 1427723 60715268 42. 53 fast mutex, 7: bf[299] 0 x 3400006 security_s_symb_s_co_id 879324 40682456 46. 27 fast mutex, 7: bf[24] 0 x 3400003 477038 19177763 40. 20 fast mutex, 3: bf[391] 0 x 320000 a 436564 17239046 39. 49 fast mutex, 7: bf[26] 0 x 3400004 security_s_symb_s_num_out 330244 14563521 44. 10 fast mutex, 7: bf[30] 0 x 3400008 security_s_symb_s_ex_id_s 176874 5183331 29. 31 fast mutex, 7: bf[38] 0 x 3400008 security_s_symb_s_ex_id_ 27050 3636325 134. 43 mutex lock, name = log 16555 1301066 78. 59 fast mutex, lockhash[3104] 18101 885933 48. 94 fast mutex, AIOSHCB lock 17274 879611 50. 92 fast mutex, 3: bf[319] 0 x 1800002 22507 744135 33. 06 fast mutex, 3: bf[655] 0 x 320000 a 18791 650429 34. 61 fast mutex, 3: bf[694] 0 x 320000 a 17600 488946 27. 78 fast mutex, 7: bf[54] 0 x 3400008 security_s_symb_s_ex_id_ 0 x 108 b 7908000 0 x 108 b 74 bc 000 pk_fk_last_trade 0 x 10177 fe 2000 daily_market_idx 0 x 108 b 74 c 4000 0 x 108 b 74 d 4000 0 x 108 b 74 f 4000 0 x 10177 f 52000 trade_history_idx 0 x 101781 f 2000 daily_market_idx 0 x 10178240000 daily_market_idx 0 x 108 b 7534000 Large numbers of waits and loops on indexes as monitored by onstat -g spi | sort +1 -nr may indicate a need for a FOT index. © 2011 IBM Corporation

B-Tree Functions Supported in FOT B-Tree functionality supported: – Primary key – Foreign key

B-Tree Functions Not Supported in FOT B-Tree functionality that not is supported: – Aggregates such as max() and min() – Clustering – FILLFACTOR – Range scans on columns listed in the HASH ON column list. – Range scan example: • create index …( col 1, col 2 ) … hash on ( col 1) with 100 buckets; • select * …. where col 1 > 0; /* not allowed */ • select * …. where col 1 = 100; /* allowed */ • select * …. where col 1 = 100 and col 2 > 300 /* allowed */ 23 © 2011 IBM Corporation

B-plus Tree (B+ tree) (4) – Informix In Informix: – Forward (Ascending) and Reversed-order (Descending) Index scan available using one single index. Examples (second one is a Key-Only search): • Select fname, lname from customer order by customer_num; • Select customer_num from customer order by customer_num desc; – Range searches get improved with this structure too. Ex: • Select customer_num, fname, lname from customer where customer_num between 50 and 300; • Select customer_num from customer where customer_num >= 50 and customer_num <= 300; 24 © 2011 IBM Corporation

B+ Tree Splits 25 © 2011 IBM Corporation

R-Tree (1) An R-tree index is a secondary data structure (or access method) that organizes data access similar to a B-tree index. Used in Informix, for multi-dimensional and spatial data. R-tree is specifically designed to index table columns that contain the following types of data: – Multidimensional data: • Spatial data in two or three dimensions – An extra dimension that represents time could also be included. • Combinations of numerical values treated as multidimensional values: – Such as a configuration for a house that includes the number of stories, the number of bedrooms, the number of baths, the age of the house, and the sqf – Range values: • Such as the time of a television program (9: 00 P. M. to 9: 30 P. M. ) • (X, Y) coordinates of geographical data. A common real-world usage for an R-tree might be: "Find all museums within 2 miles (3. 2 km) of my current location". 26 © 2011 IBM Corporation

R-Tree (2) Simple example of an R-tree for 2 D rectangles: 27 © 2011

Reading through a B+ tree Index (Index scan) When you access a row through an index: – You read the B+ tree starting at the root node and follow the nodes down to the lowest level, which contains the pointer to the data. In the example below, 3 index read operations are needed to find the pointer to the data. Keep key size to a minimum for two reasons – To allow a single index page in memory to hold more key values: • Reduces the number of read operations necessary to look up several rows. – To have fewer B+ tree levels in the index, very important for performance: • An index with a 4 -level tree needs 1 more read per row than an index with a 3 -level tree • If 100, 000 rows read in 1 hour -> 100, 000 fewer reads needed to obtain the same data. For Informix (IDS), the size of a node is the size of one page. 3 level b+ tree. # of page reads needed to find a row: If key-only scan (no need to get the row data page): 3 (index-only) • If we need to get the data pages: 4 (3 index + 1 data) 28 © 2011 IBM Corporation

Placement of an Index – Informix Storage Review Physical Storage Units: Chunk – Largest unit of physical disk dedicated to Informix DB server data storage. – Provide DBAs with a significantly large unit for allocating disk space. – Max size of individual chunk is 4 TB. – Number of allowable chunks is 32, 766. – It is a disk unit: • Raw device (disk partition), or • Regular file system file (cooked file). – Contains a certain number of pages. – Pages from different tables/indexes can be allocated in the same chunk. Page – Minimum I/O unit. – Commonly 2 KB or 4 KB, configurable up to 16 K. 29 © 2011 IBM Corporation

Placement of an Index – Review of Informix storage Physical Storage Units (cont): Extent – Group of contiguous pages within a chunk that store data for a given table, index, table fragment or index fragment. – When creating a table, you specify its initial extent size and next extent size. – As the table grows, it can have multiple extents. – Important to manage appropriate extent sizes and number of extents. – Default size configurable in Informix configuration file (ONCONFIG): • If not set, default is 16 k. – An extent contains pages of a single table, index or fragment. 30 © 2011 IBM Corporation

Placement of an Index – Informix Storage Review Logical Storage Units Tblspace – Logical collection of all the extents (not necessarily contiguous) allocated to a specific table, index or fragment. – It can include extents stored on a single chunk or on multiple chunks. – A tblspace, however, is always contained within a single dbspace. Dbspace – Logical collection of chunks. – Form a pool of disk space that is used to store DBs, tables, indexes or fragments. – A single dbspace can contain pages of different tables/indexes. – Special purpose dbspaces: • Root dbspace: stores system’s catalog information and databases. • Blobspace: stores simple binary large objects. • Sbspace: stores smart binary large objects. • Temporary dbspaces: for non-logged temp data. 31 © 2011 IBM Corporation

Placement of an Index – Informix Storage Review CREATE DATABASE stores_demo WITH LOG IN dbs 1; CREATE TABLE customer (customer_num integer, …) IN dbs 2; CREATE INDEX ix_cust ON customer (customer_num) IN dbs 3; You can place different database objects in different disk spaces: – To balance I/O work across the disk devices and controllers available Example: – The database stores_demo in a data dbspace dbs 1 • If no dbspace is specified, the database is created in rootdbs (the Root dbspace). – The table customer in a data dbspace dbs 2 • If no dbspace is specified, the table is created in a new tblspace where its database resides. – The index ix_cust in a data dbspace dbs 3 • If no dbspace is specified, the index is created in a new tblspace in the same dbspace as its table. • When an index is created on an empty table, just the root node of the B-tree is created. 32 © 2011 IBM Corporation

Attached vs Detached Indexes (1) In the past, indexes used to be attached to the table – meaning they did not have a separate tblspace, and the index pages were interleaving with the data pages of the table in the table’s tblspace. Now, on Informix, all indexes are detached – meaning index extents are stored separately (in a new tblspace) from table extents, even if they are placed within the same dbspace as the table. 33 © 2011 IBM Corporation

Attached vs Detached Indexes (2) An index can be placed in a separate dbspace. Example, this index is stored in a separate dbspace called cust_ix_dbs: CREATE INDEX customer_ix – ON customer (zipcode) – IN cust_ix_dbs; By default, index extents are created in the dbspace that holds the data (table) extents. A detached index can have a fragmentation strategy different from the one used by its table, that you set up explicitly with CREATE INDEX. 34 © 2011 IBM Corporation

Bidirectional Traversal of Indexes (1) ASC and DESC keywords specify the order to maintain the index. When creating an index on a column, if you omit or specify the ASC keyword, Informix stores the key values in ascending order: – Default column order. – From the smallest to the largest key. – Example on customer’s last name: Albertson, Beatty, Currie. § create index ix_cust on customer(lname asc); § or § create index ix_cust on customer(lname); Use DESC keyword for Informix to store the key values in descending order: – From the largest to the smallest key. – Example on customer’s last names: Currie, Beatty, Albertson. § create index ix_cust on customer(lname desc); 35 © 2011 IBM Corporation

Bidirectional Traversal of Indexes (2) Informix’s bidirectional traversal capability of the database server lets you create just one index on a column and use that index for queries that specify sorting of results in either ascending or descending order of the sort column. 36 © 2011 IBM Corporation

Implicit vs Explicit Index Implicit indexes are created when a constraint (primary key, foreign key, unique constraint) is defined that cannot use an existing index – You cannot specify a dbspace location, fragmentation strategy or fill factor for the index. – Implicit indexes are created in the same dbspace as the database: • CREATE TABLE tab 1 ( • col 1 INTEGER, • col 2 INTEGER, • col 3 CHAR(25), • PRIMARY KEY (col 1)) • IN table 1 dbs; Explicit indexes are created using CREATE INDEX statement – It is recommended to explicitly create indexes that exactly match the referential constraint and then use ALTER TABLE to add the constraint: • The constraint will use the existing index instead of implicitly creating one • CREATE TABLE tab 1 (col 1 INTEGER, col 2 INTEGER, col 3 CHAR (25)) • IN table 1 dbs; • CREATE INDEX index 1 ON TABLE table_name(col 1) IN idx 1 dbs FILLFACTOR 70; • ALTER TABLE tab 1 ADD CONSTRAINT PRIMARY KEY (col 1); 37 © 2011 IBM Corporation

Unique vs Duplicate Index Unique indexes allow only one occurrence of a value in the indexed column – Created for columns whose values cannot be repeated within the table. – Used to enforce primary key (PK)’s uniqueness or unique constraints. – An index entry is created for each row in the table, prevents duplicates. – Ex: customer_num, SSN, employee_id A non-unique, duplicate or secondary index is an index based in a non-key attribute, and allows identical values for multiple rows in an indexed column – Avoid having highly duplicated indexes • Indexes become less effective as they are less unique. – Ex: city, first_name, last_name, order_date, zipcode. CREATE UNIQUE INDEX cust_num_ix ON customer(customer_num); or CREATE DISTINCT INDEX cust_num_ix ON customer(customer_num); CREATE INDEX cust_lname_ix ON customer(lname); 38 © 2011 IBM Corporation

Simple vs Composite Index A simple index lists only one column (or for IDS, only one column or function) in its index key specification. Example: – CREATE INDEX cust_lname_ix – ON customer(lname); Any index listing two or more columns is a composite index: – List the columns in the order from most frequently used to least frequently used – Facilitates multiple column joins – Increases uniqueness of indexed values – Example: – CREATE INDEX ix_items ON – items(manu_code, stock_num); 39 © 2011 IBM Corporation

Taking advantage of a Composite Index On Informix, the optimizer can use a composite index (one that covers more than one column) in several ways: – You can use an index on columns a, b, and c (in that order) in these ways: – CREATE INDEX ix_sample ON sample_table (a, b, c); – To locate a particular row using partial-key search: – – WHERE a=1 a>=12 AND a<15 a=1 AND b < 5 a=1 AND b = 17 AND c >= 40 – The following examples of filters cannot use that composite index: – WHERE b=10 – WHERE c=221 – WHERE a>=12 AND b=15 – To replace a table scan by a key-only search: • when all of the desired columns are contained within the index. – To join column a, columns ab, or columns abc to another table. – To implement ORDER BY or GROUP BY on columns a, ab or abc • but not on b, c, ac, or bc. 40 © 2011 IBM Corporation

Cluster Indexes (1) Used to physically order records of the table according to the index, remove extents interleaving – making pages in the table contiguous, and avoid sorts. Use a cluster index: – In static tables, tables subject to few or no modifications, and – Where the table is frequently read. CLUSTER physically reorders the rows of the table: – Informix rewrites the table data rows to match the index order. – Therefore, each table can have only ONE cluster index. – Ex: Create an index on a customer table and physically order the rows according to their last name values, in (by default) ascending order: • CREATE CLUSTER INDEX ix_cust ON customer(lname); 41 © 2011 IBM Corporation

Cluster Indexes (2) Over time, Informix does not maintain clustering of the data rows as new rows are inserted or as existing key values are updated: – If the table is modified, the benefit of an earlier cluster will disappear as rows are added in space-available, not sequential, order. – Cluster indexes are most effective on relatively static tables but less effective on very dynamic tables. – You can recluster the table to regain performance by issuing another ALTER INDEX TO CLUSTER statement on the clustered index: • ALTER INDEX ix_cust TO CLUSTER; The TO NOT CLUSTER option drops the cluster attribute on the index name without affecting the physical table. 42 © 2011 IBM Corporation

Functional Indexes (1) A functional index is one in which all keys derive from the results of a function. The functional index can be a B-tree index, an R-tree index, or a user-defined secondary-access method. R-Tree example: If you have a column of pictures, for example, and a function to identify the predominant color, you can create an index on the result of the function: – Such an index would enable you to quickly retrieve all pictures having the same predominant color without re-executing the function. 43 © 2011 IBM Corporation

Functional Indexes (2) The function must be a user-defined function (UDF) – You cannot create a functional index on any built-in function of SQL. – But you can create a UDF that calls a built in function and use this UDF as the index key of a functional index. B-Tree examples: – invalid (upper is an Informix built-in function): – CREATE INDEX ix 1 on state (UPPER(sname)); – valid (define UDF myupper as not variant): – – – – 44 CREATE FUNCTION myupper (v_value char(15)) RETURNING char(15) with (not variant); define r_value char(15); execute function upper(v_value) into r_value; return r_value; END FUNCTION; create index ix 1 on state (myupper (sname)); © 2011 IBM Corporation

Index Fill Factor (1) The DBA can specify the percentage of each page/node that the index will fill during index creation. This is the index fill factor. The index fill factor is not maintained over the life of the index. Works only for index build. 45 © 2011 IBM Corporation

Index Fill Factor (2) The Informix onconfig FILLFACTOR parameter sets the system default and is used by all indexes created in the system: – If the FILLFACTOR is not specified the default is 90 Unless all table indexes receive the same type of activity it is recommended to use the FILLFACTOR option in the CREATE INDEX statement. Ex: § CREATE INDEX state_code_idx ON state(code) § FILLFACTOR 80; 46 © 2011 IBM Corporation

Index Fill Factor (3) A high fill factor will produce an initially compact/dense index, providing more efficient caching and reducing the number of pages to read when retrieving rows: – Use a FILLFACTOR of 100 for tables that are receive selects (read only) or deletes to minimize the merging and shuffling pages as keys are removed. Creating an index with a lower fill factor will produce a sparse index (with more pages and probably levels to read), which can delay the need for node (page) splitting and the accompanying performance impact. 47 © 2011 IBM Corporation

Creating indexes Examples: § CREATE UNIQUE INDEX ix_orders ON § orders(orders_num) § IN idx_dbs; § CREATE INDEX ix_items ON § items(manu_code, stock_num); § CREATE UNIQUE CLUSTER INDEX § ix_manufact ON manufact(manu_code) § FILLFACTOR 80; § CREATE INDEX ix_man_stk ON § items(manu_code desc, stock_num); § CREATE INDEX order_ix 1 ON orders § (order_num, order_date desc); 48 © 2011 IBM Corporation

Altering, Dropping, and Renaming Indexes Examples: § ALTER INDEX ix_man_cd TO CLUSTER; § RENAME

Index Partitioning (Fragmentation) Fragmentation is the distribution of data or index from one table across separate dbspaces (logical groups of disk storage devices). In Informix, you can create several partitions of a table/index in the same and/or different dbspaces. Each fragment is stored in its own tablespace (group of extents). 50 © 2011 IBM Corporation

Advantages of Fragmentation Parallel scans and other parallel operations: – Several disk devices can be read/processed in parallel. Balanced I/O: – Fragment discrimination: We only do I/O in the portion/fragment that contains the data we are asking for. Higher availability: – If a dbspace containing a fragment is unavailable, we may still read the data in the other fragments that are up and available. Scalability (ability to grow in size): – There is a limit in the maximum amount of data per fragment/dbspace, by fragmenting the table/index, we are giving the chance to grow more. 51 © 2011 IBM Corporation

Types of Distribution Schemes Round Robin: Makes sense in tables, not allowed in index: § § § CREATE TABLE table 1( col_1 SERIAL, col_2 CHAR(20)) FRAGMENT BY ROUND ROBIN IN dbspace 1, dbspace 2; Expression-based: Makes sense in tables and indexes: CREATE TABLE t 1( col_1 SERIAL, col_2 CHAR(20)) FRAGMENT BY EXPRESSION col 1 <= 100 IN dbspace 1, col 1 > 100 AND col 1 < 500 IN dbspace 2 REMAINDER IN dbspace 3; 52 © 2011 IBM Corporation

Fragment by Expression – Using Hash Functions Distribute the data of table 1 into 3 fragments using hash function mod: § CREATE TABLE table 1( § customer_num SERIAL § lname CHAR(20). . . ) § FRAGMENT BY EXPRESSION § MOD(customer_num, 3) = 0 IN dbspace 1, § MOD(customer_num, 3) = 1 IN dbspace 2, § MOD(customer_num, 3) = 2 IN dbspace 3; 53 © 2011 IBM Corporation

Fragmented/Partitioned Indexes 54 © 2011 IBM Corporation

Fragmentation – CREATE INDEX Statement By expression: § CREATE INDEX idx 1 ON table 1(col_1) § FRAGMENT BY EXPRESSION § col_1 < 10000 IN dbspace 1, § col_1 >= 10000 IN dbspace 2; No fragmentation scheme is specified: § CREATE INDEX idx 1 ON table 1(col_1) § IN dbspace 1; Partitioned: § CREATE INDEX idx 1 ON table 1(col_1) § PARTITION BY EXPRESSION § PARTITION part 1 col_1 < 10000 IN dbspace 1, § PARTITION part 2 col_1 >= 10000 IN dbspace 2; 55 © 2011 IBM Corporation

Fragmenting an Index Discussion: Is this statement valid? § CREATE UNIQUE INDEX ia ON tabl(col 1) § FRAGMENT BY EXPRESSION § col 2 <= 10 IN dbsp 1, § col 2 > 10 AND col 2 <= 100 IN dbsp 2, § col 2 > 100 IN dbsp 3; 56 © 2011 IBM Corporation

Parallel Index Build (Informix) IDS implements a sophisticated design to enable extremely fast index builds. This design divides the index build process into three (3) subtasks, for vertical parallelism: – First, scan threads read the data from disk. – Next, the data is passed to the sort threads. – Finally, the sorted sets are appended into a single index tree. The indexes are built in parallel even if Parallel Data Query (PDQ) PRIORITY is set to 0, but you can tune this environment variable for performance gain. 57 © 2011 IBM Corporation

Informix SYSINDEXES System Catalog The sysindexes table describes each index in the database: § § SELECT sysindexes. * FROM sysindexes, systables WHERE tabname = "items" AND systables. tabid = sysindexes. tabid; Results: § § § § idxname owner tabid idxtype Clustered part 1 part 2 part 3 part 4 … part 16 levels leaves nunique clust § § … 3 row(s) retrieved. <- items table has 3 indexes defined on it, although we are showing just partial results 58 104_10 <- implicit system-generated index, ‘ 104_10’ admin 104 <- table id for table ‘items’ U <- unique index, so the index was created to implement a unique constraint <- not clustered 1 2 0 0 0 1 <- 1 level 1. 000000 <- 1 leaf node / page 6. 000000 1. 000000 © 2011 IBM Corporation

Steps to Indexing (1) Determine if the index will be attached, detached or fragmented – If fragmented, determine the expression. Determine the FILLFACTOR for each index. Determine other characteristics of the index, like: – Simple or composite, and default order of the columns. – Cluster or not clustered. – Unique or duplicate. 59 © 2011 IBM Corporation

Steps to Indexing (2) Calculate the space requirements for the index and temporary dbspace to build the index: – Determine the optimal locations of the index and temporary dbspaces (defined in DBSPACETEMP) on disk. Set the PDQPRIORITY and PSORT_NPROCS environment variables: – To optimize the parallel index parallel builds. Check and optimize configuration parameters if needed: – B-tree cleaners (threads that automatically clean up deleted keys). – Buffers (to cache I/O of data and index pages). – Temporary dbspaces (for sorts in index builds). – Parallelism (to speed up index builds and other engine operations). 60 © 2011 IBM Corporation

Estimating Extent Size of an Index It is possible to estimate the index size before creation. For an attached index, the database server uses the ratio of the index key size to the row size: § Index extent size = § (index_key_size/table_row_size) * table_extent_size For a detached index, the database server uses the ratio of the index key size (plus some overhead in bytes) to the rowsize: § Detached Index extent size = § ((index_key_size + 13)/table_row_size) * table_extent_size After created, run a query on the system catalog tables or run oncheck –pe or oncheck -p. T mydb: mytab to get the index size. 61 © 2011 IBM Corporation

Benefits of Indexing Use filtering to reduce the number of pages read (I/O). Eliminate sorts implementations (as temporary tables). Ensure uniqueness of key values (constraints). Reduce the number of pages read (key-only reads). 62 © 2011 IBM Corporation

Costs of Indexing Disk Space Costs Processing and Maintenance Costs 63 © 2011 IBM

Summary – Benefits and Costs of Indexing Benefits – Faster access: • Minimizes I/O, avoids implementing sorts. – Enforce constraints and business rules: • Primary key, foreign keys, unique values. Costs (Overhead) – Space cost: Require disk space: • There are ways to estimate the size an index will have. – Time cost for maintenance: Insert, update, delete operations: • Both the data and indexes need to be updated. 64 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (1) Create an index on: – Join columns (used in WHERE clause for multi-table joins). – Highly-selective filter columns (columns used in WHERE clauses). – Columns frequently used for ordering and grouping (sorting, columns used in ORDER BY and GROUP BY clauses). – Also columns used in UNIQUE / DISTINCT clauses. – And columns used in UNION statements (where we combine queries). Avoid highly duplicate indexes: – Prefer indexes on high-selectivity columns. – You can use composite indexes to increase index selectivity. 65 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (2) Limit index in highly-updated table/columns and volatile/temp tables: – Remember overhead cost of index maintenance (btree cleaner). Limit number of indexes on tables. Keep key size small: – Long character strings are not good candidates for indexes. – Try to use small columns, like smallint, integer, date, char(<small n>). Keep the number of indexes small: – Create only the indexes you really need. 66 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (3) Index only if you need to access <= 4 or 5 % of the data in a table: – The alternative to using an index to access row data in a table is to read the entire table sequentially from top to bottom (sequential scan or full table scan). – Sequential scans are better for queries that require a high percentage of the data in a table. – Remember: Using indexes to retrieve rows requires two reads: an index read followed by a table/data read. 67 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (4) Avoid indexes on relatively small tables: – Sequential table scans are just fine for small tables. – There is no need to store both table and index data for small tables. Create primary keys (or, even better, explicit unique index followed by a primary key constraint) for all tables, as possible: – Remember: Even if you don’t explicitly create an index for a primary, foreign key and unique constraint, Informix will implicitly create an index for you. Need a functional index? 68 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (5) Use composite indexes to increase uniqueness: – Composite indexes may need to be used where single-column values may not be unique by themselves. – Remember: In composite indexes, the driving is the first column, and this should be the most selective column in the index. Use clustered indexes to speed up retrieval: – Remember: Only a single cluster index is allowed per table, and will reorder the data rows according to the index. See guidelines for cluster. 69 © 2011 IBM Corporation

Indexing Guidelines and Best Practices (6) Disable indexes before large DML operations: – Massive loads, inserts, deletes, updates. – You can re-enable the indexes and constraints after the large DML operation is finished. Remember: – The index on a table should be based on the types of queries you expect to occur against the table's columns. – Always give priority to the most frequently executed queries. – More indexes than the ones you need will produce the cost-based query Optimizer to do additional work to decide which index to use. 70 © 2011 IBM Corporation

Troubleshooting (1) When an Index should be used but it is not: – To see if an index is used in a query plan: • On Informix: use SET EXPLAIN or visual explain, analyze query plan and cost. – Check and repair possible Index Corruption: • On Informix: oncheck –c. I utility (-c: check, -I: indexes). – Update Optimizer’s statistics: • On Informix: update statistics command. – Check if the index is enabled or disabled: • If disabled, enable it: set constraints/indexes … enabled 71 © 2011 IBM Corporation

Troubleshooting (2) When an Index should be used but it is not – Check if this is a redundant index with another: (cont’d): • Remove unnecessary indexes, re-run statistics as needed. – Check the same query on a newer version of the DBMS engine: • If the index is used and the query runs faster, this might mean a defect in the previous release of the product that is fixed in the newest one. – Try forcing the index using an Optimizer Directive: • If times and costs are better using the index that was forced, this might be a defect on the DBMS. Collect all the information and call Tech Support. 72 © 2011 IBM Corporation

References IBM Informix 11. 70 Information Center (Manuals online) http: //publib. boulder. ibm. com/infocenter/idshelp/v 117/index. jsp IBM IDS manuals available in PDF format https: //www-304. ibm. com/support/docview. wss? uid=swg 27019520&wv=1 All IBM Informix technical publications and link to Support site http: //www-01. ibm. com/software/data/informix/pubs/ https: //www-304. ibm. com/support/docview. wss? uid=swg 27020950&wv=1 Look for: Getting Started Guide, Design and Implementation Guide, SQL Tutorial, Performance Guide and Administrator’s Guide Old but good article on fragmentation: Divide and conquer: Using fragmentation for smart data access http: //www. ibm. com/developerworks/db 2/zones/informix/library/techarticle/0 206 metzger/0206 metzger 1. html 73 © 2011 IBM Corporation