Cutting DB2 Tables Down to Size: Part II By David Halterman David Halterman, Senior Systems Software Engineer at Sterling Software, Systems Software Laboratories Division, has 16 years of mainframe software experience in data center operations, technical support, and software development. In most organizations, the implementation of DB2 has been accompanied by an almost "out of control" growth in the DASD required to handle the extremely large size of DB2 tables. Saving DASD has become a critical issue in most DB2 sites. Many information systems (IS) managers are investing time and money in compression utilities, hoping to contain soaring DASD costs. These utilities are designed to achieve compression ratios of up to 80 percent. But unless you compare the before and after DASD utilization, you will never know if you are getting the best return on your compression efforts. This is the second of three articles examining DB2 compression. The first article, "Cutting DB2 Tables Down to Size With Data Compression," (Technical Support Magazine, March 1992), discussed the impact of data compression on DB2 objects, processing and overhead. It attempted to explain how you can balance requirements for space savings against increased overhead. This second article describes methods of evaluating compression in DB2. It provides the technical information you need to evaluate DB2 data compression products for your specific requirements. The third article will present the experiences of several organizations that have implemented DB2 compression using Sterling Software's SHRINK for DB2. The information contained in this article has been collected over a period of two years from development and testing activities at Sterling Software's Software Labs Division (San Bernardino, California) and from evaluation activities at many companies located around the world. Additional information has been obtained from discussions with IBM developers and support personnel, and from various IBM publications. Substantial effort has been made to ensure that the techniques and suggestions are not product specific, making the information applicable to a broad range of situations. Methods of Evaluating Compression There are two phases to evaluating compression: the compression analysis process and the actual implementation and testing. Because the implementation of compression in DB2 is often a time-consuming process, the ability to perform compression analysis is important. Compression analysis enables you to determine whether or not the DASD savings achieved through compression is sufficient to offset any additional overhead incurred. As a result, you can concentrate on implementation activities only for those tables that offer the highest return. Most vendors provide a method for determining the amount of compression that will be achieved if compression is implemented using their routines. The accuracy of this analysis, however, depends on the method used for performing the analysis. Trial compression, for example, is the most effective method. In trial compression, rows from selected tables are read and the compression algorithm is applied to the data. The analysis routine maintains statistics regarding the average amount of compression a routine can achieve and reports the results to the user. The least accurate method, on the other hand, examines the DB2 catalog for information regarding the types and length of data that a table contains. The analysis applies average amounts of expected compression to these statistics, then reports the anticipated compression results. Because this type of analysis does not read actual data, it ignores the most significant indicator of potential compression. Without reading the data, it can only guess at how effectively the compression routine will function. Some vendors provide the ability to choose from several compression routines, enabling you to achieve the balance of compression and overhead that meets your business objectives. In this case, the analysis facility should include the ability to assess each routine and its relative performance against other routines. Controlling Compression Analysis With limited time available, you must be able to analyze a variety of tables quickly to determine which tables to compress. As a result, the ability to control the number of rows to be analyzed and the maximum amount of time to spend on analysis becomes important. This control over compression analysis provides flexibility and enables you to obtain the accurate information that you require. In addition, granularity of data, which affects compression, makes the ability to start the analysis at a certain row of the table or to skip a certain number of rows important in obtaining accurate compression statistics. Vendors of DB2 compression products have typically provided one of two facilities for performing compression analysis: batch-oriented analysis and ISPF-oriented analysis. Some vendors provide both. Batch analysis involves the running of a batch job to report compression analysis information. It is useful for extensive analysis of large tables because you are free to perform other tasks while the analysis runs. However, the turnaround time for running batch jobs and the need to specify the control parameters correctly are drawbacks. Consolidated information is also difficult to view, making it harder to reach decisions about which tables to consider for compression implementation. In ISPF analysis, on the other hand, panels guide you through compression analysis and present the results of that analysis. For example, from a list of candidate tables you select one for compression analysis. After analysis, you can review the results and compare them with other tables. With this information, you can more easily determine the benefits of implementing compression on one table over another. The specification of control parameters is also simplified within ISPF. The ability to save the results of the analysis eliminates the need to re-run the analysis to review the results. If the results can be saved in a DB2 table, you also have the advantage of executing SQL statements against them to determine which tables are good candidates for compression. Comparing Expected Results with Actual Results Compression analysis indicates the anticipated results of compression. There is only one way to verify the accuracy of the analysis: compress the table. To implement compression on a table you modify the CREATE TABLE DDL to include the EDITPROC parameter, which specifies the name of the compression routine. You must first, however, determine the amount of space the uncompressed table uses to provide a basis of comparison with the compressed table. The quickest way to do this is to run the RUNSTATS utility to update values in the DB2 catalog. The current number of pages may then be obtained by running the following SQL statement, using SPUFI or QMF: SELECT CREATOR, NAME, CARDS, NPAGES FROM SYSIBM.SYSTABLES WHERE CREATOR = creator AND NAME = tablename; You may also determine the amount of DASD space the tablespace is currently using by running VSAM utilities. Once a table is loaded, with compression implemented, you must run RUNSTATS again to update the values in the DB2 catalog--then execute the same SQL statement specified above to determine the number of pages used by the compressed tablespace. Using the uncompressed and compressed NPAGES values in the following formula calculates the percent of tablespace compression achieved for the compressed table: % of Compression=(# Pages without compression - # Pages with compression)/# Pages without * 100 The percent of compression should equal the results calculated by the compression analysis, although other factors can affect the results. The factors listed below may cause a discrepancy between the amount of compression reported by analysis and the actual compression achieved. o Reorg needed--If the uncompressed table contains a large number of rows that use free space, NPAGES will not reflect the true value of a freshly loaded, or reorged, table. You may need to run REORG prior to running RUNSTATS to obtain a more accurate value for NPAGES for the uncompressed table. o Values changed--If the values for PCTFREE or FREEPAGE are changed for the compressed table, the results will not be based on consistent numbers. You should ensure that the values are consistent by using SQL to query the DB2 catalog. o Shared table space--If the table being compressed shares a tablespace with another table, the rows of the other table will impact the number of pages that the compressed table requires. Most compression analysis figures assume one table per tablespace when calculating space requirements. o Overhead pages--The percent of compression in the "percent of compression formula" is based on the number of pages in the table. DB2 requires two overhead pages which are included in the NPAGES value. If the total number of pages in the table is very small, these overhead pages cause the percent of compression to be lower than expected. If this is the case, subtract two from the before and after NPAGES figures and repeat the calculation for percent of compression. o Vendor data--There are two values that vendors should provide regarding compression of DB2 tables. The first is percent of row compression. This figure (provided by most vendors) is the average amount of compression of each row in the table. The second is the percent of tablespace compression, a figure provided by only a few vendors. It is the more accurate figure to use in determining how much DASD will be saved. The reasons for this are fully explained in the section on realizing space savings in the previous article. The more information provided by compression analysis, the more accurate you can be when comparing anticipated and actual results. For example, before and after results on Number of Pages and Number of Megabytes may cause discrepancies to be more apparent. Implementation Compression is implemented by specifying the name of the routine to use for compression in the EDITPROC parameter of the CREATE TABLE DDL. In this parameter you specify the name of the routine you want to use for compression. You may need to create this routine if you choose a custom or unique routine. You must carefully follow steps defined by the vendor to create the routine. Make sure the new routine is available to DB2 by placing it in a load library concatenated to the DB2 STEPLIB DD statement in the DB2 start up procedure. IBM recommends using the DSNEXIT library for this purpose. Prior to implementing compression, you may wish to change the allocations for the tablespace which contains the table to be compressed--you may not realize space savings if the original allocations never exceeded the primary allocation amount. The amount you choose to allocate to the compressed table may be determined by using figures provided by the compression analysis product. If the product reported Number of Megabytes or Number of Pages for the compressed table, use this as a starting point for determining the correct value. Choosing an Appropriate Compression Routine Your choice of a DB2 data compression product will depend, to a large extent, on the performance of the compression routine or routines the product offers. During the evaluation process you may find that you can achieve your current objectives with a product that supplies only a single routine. However, you should keep in mind that once you purchase a product, you are limited to the compression routines that product supports. As a result, you should consider future requirements in making your purchase decision. The ability to select from several routines ensures greater control over the process today--and accommodates new requirements stemming from future growth. For these reasons, you must carefully consider your choice of a product (and its routines). There are several factors to include in the selection process. The amount of compression a routine provides is the primary factor in determining whether or not to use that routine. This information is easily obtained by performing whatever compression analysis process the vendor provides for this purpose. The amount of overhead a compression routine adds to DB2 processing is a difficult figure to obtain. Due to the nature of DB2, overhead is a critical issue. Many companies want to use a DB2 compression product, but have been forced to accept high overhead for a fixed amount of compression from a single routine. In some cases they have not been able to justify the cost of compression. The ability to choose from a variety of routines, selecting one which delivers adequate compression with acceptable overhead performance may be critical to your justification of a compression product. Some vendors who provide a selection of routines also provide relative performance information for each routine. This information may be used to select a routine for performing implementation. To select a compression product and routine, you must first determine what your enterprise objectives are for implementing compression in DB2. In other words, what is your primary objective? o Maximum compression at any cost? o Adequate compression with acceptable overhead? o Any compression with minimal overhead? There are several considerations other than amount of compression and overhead to include in your evaluation of compression products and routines. These include data integrity, applicability of the routine and the vendor profile. Guaranteeing the integrity of your data is essential. This capability is implemented in different ways, but a routine which does not have data validation can result in corrupt data being returned to your application. The applicability of a routine, as discussed in the section "Types of Compression Routines" in the previous article, may be a concern. Restricting the use of a compression routine to standard routines can simplify procedures. However, the compression and overhead benefits of using custom or unique routines may outweigh the benefit of simplicity. You may also want to consider the installed base, the length of time the product has been available and the strength of the vendor when selecting a compression product. Compression Overhead Evaluation Factors You may have specific expectations regarding what constitutes acceptable overhead from a compression product. Whether or not your expectations are realized depends on many factors. Different applications, DB2 processes and compression routines all result in different amounts of overhead. Other system-related factors, some very technical and complex, may also affect overhead. The following information describes the potential negative impact these factors may have on performance evaluation. They are not discussed in any particular order. Their relative importance depends on the environment in which the evaluation is conducted. The level of activity within an MVS system during the compression performance evaluation has a significant impact on the perceived level of overhead for compression. This is because compression is a CPU intensive task and any activity that increases the CPU time charged to the task can distort the overhead measurement. When a test is run on a system with a high level of activity, MVS is constantly processing interrupts. The amount of time required to perform the entry logic to an interrupt routine is charged to the task which was interrupted. When a task is restarted, the amount of processing required to restart the task is charged to the task being restarted. Therefore, compression is charged with both the interrupt and restart time. Although this may seem insignificant, our laboratory tests have shown overhead figures to vary by as much as 50 percent, depending on the level of activity within the MVS system. The higher the level of activity, the less accurate your performance figures will be. Increased activity within an MVS system also affects cache memory. Most of the machines today use instruction and data caching. The higher the level of activity, the less likely instructions and data are to remain in cache. Laboratory tests have shown that the availability of instructions and data in cache may result in as much as a 50 percent decrease in processing time! The discussion regarding other activity within MVS also applies to performance tests run in a PR/SM environment --with a significant difference. In this environment, CPU times are affected by both MVS and PR/SM processing. Within PR/SM, shared LPARS result in additional CPU cycles to handle the timer interrupts and multiple users of the high speed cache buffers. The CPU time due to PR/SM affects the job's TCB, SRB and uncaptured times. The more LPARs and activity in each LPAR, the higher the associated CPU time. Studies indicate that TCB and SRB times will increase from 3 to 23 percent. This affects performance timing significantly. The elapsed time that a process takes also suffers in this environment. The same discussion for LPARs (PR/SM) applies to performance testing in the VM environment. The level of activity within DB2 during performance testing influences the amount of CPU time and elapsed time that a task takes. The buffer pool, queues and paging are all affected by DB2 activity. The priority of tasks within DB2 also affects both elapsed time and CPU time. The effect on elapsed time is obvious. CPU time is affected because of the additional interrupts that higher priority tasks cause to lower priority processes. IBM has improved performance with each release of DB2. Since the execution path of DB2 processes have an impact on the measure of performance, the performance information obtained on one version of DB2 is not necessarily comparable to another release. This is a concern if your test system is a different version of DB2 than your production system. The CPU model on which the tests are run impact the results in two ways. First, the CPU model is considered by the DB2 access path selection algorithm. Since access path is one of the most crucial factors in the performance of a process, this impacts results. Second, different CPU models have different implementations of the microcode which determines the execution speed of individual instructions. Therefore, the results of performance evaluation on a test machine may vary from those on a production machine. The I/O set that DB2 uses also impacts performance. This will be most apparent in elapsed time. CPU time, however, also reflects differences in I/O speed. Performing Compression Tests To perform compression tests for the purpose of determining the amount of compression and, more importantly, the amount of overhead, carefully set up the tests and the environment in which the test will be performed. A representative sample of processes (SELECT, UPDATE, LOAD, UNLOAD, DELETE, etc.) should be executed to provide a variety of data on which to base your decision about which compression routine to use in production. The nature of compression routines is such that it is not possible to extrapolate the overhead of one application process to another. You should test each process that is critical to your environment to provide the figures necessary to the decision-making process. Perform the tests for every table on which you are considering compression. Most tables contain different types of data but, more importantly, they contain different data values. Compression routines are sensitive to the values contained in a row. The only way to determine the impact of a compression routine on a particular table is to test it. When you perform the tests, take particular care to minimize factors that may influence results. These factors were discussed earlier in this article. To produce the most accurate results when comparing routines, perform the tests with one routine, and immediately follow them with the performance tests from another routine. This ensures that the environment which may affect performance figures is similar for the two tests. You can measure the overhead of a particular process in different ways. The most accurate method involves using the DB2 accounting trace facility. Starting an accounting trace with Class 1, 2 and 3 provides detailed information that shows the precise amount of CPU time required for a process. To capture the necessary information, perform the process with and without compression with the accounting trace active. You may then use DB2PM to format the results for comparison. The value in the DB2PM report that most accurately depicts the overhead of a compression routine may be found under the column titled "DB2 Times (Class 2)". The most realistic assessment of the effect of compression on the application can be found under the column titled "Application Times (Class 1)". An alternate, but less accurate, method involves using JES statistics. The CPU time reported by JES may be obtained with and without compression and the results compared. The measurement of compression may be done when performing compression tests as described in the section "Comparing Expected Results With Actual Results" in this article. Reviewing Compression Tests Once you have the results of performance tests, review them for obvious problems. A quick look at the figures will often reveal if other factors have influenced the results of the test. For example, if the CPU time without compression is 1 minute and the elapsed time is 10 minutes; and the CPU time with compression is 1 minute 30 seconds and the elapsed time is 30 minutes; it is easy to see that some other factor influenced the results of the test. An increase in CPU time of 50 percent should never cause an increase in elapsed time of 300 percent! An expensive but dependable method for reviewing the results of a test is to run it multiple times. In our laboratory, we perform the tests at least three times--and sometimes much more if the resulting figures are not consistent. Summary On the surface, implementing compression may seem like a simple process. In fact, you must consider a variety of factors before determining which compression product best meets the needs of your organization. As we have seen, compressing DB2 tables is only part of the process. Because of DB2's unique characteristics, compression may not always result in DASD savings--which, of course, is the goal of data compression. Additionally, the price of compression, in terms of increased CPU utilization may be too high. It is essential, therefore, that you compare the before and after DASD utilization to determine whether or not you are achieving the levels of compression you expect. Your compression strategy should include a means of not only analyzing anticipated compression ratios, but also of measuring the results after the fact to ensure you are getting the maximum return on your investment. /* 3654 Was this article of value to you? If so, please let us know by circling Reader Service No. 00.