CMCDONOU.JUN VSAM TUNING MADE SIMPLE! SORT OF.... PART I by CRAIG R. McDONOUGH Mr. McDonough has been in data processing for 18 years, the last seven as a DOS/VSE systems programmer. Prior to becoming a systems programmer, he gained experience as an applications programmer, systems analyst and independent consultant. Introduction Of all the DASD access methods that IBM supports, VSAM is perhaps the most responsive to (and dependent upon) "tuning." Tuning is defined as the optimizing of DASD performance and/or utilization through the manipulation of the parameters available when defining or using a file. The main tuning options available can be analyzed and optimized while an application is being planned or, to a certain extent, as a "retrofit" to an existing application During a maintenance or designed change. The examples chosen here and the specific recommendations given will primarily refer to VSE/VSAM files, but the techniques are, in most cases, just as applicable to OS/VSAM. Tuning Parameters The items to consider for VSAM tuning are: o Logical record size; o Access pattern; o Control interval size; o Control area size; o Imbedded free space; o Index options; and o Bufferspace. The first four will be covered in this article, Part I of a series to be continued in future months. Logical Record Length The logical record length of the application record is very application-design specific, and as such does not lend itself to external manipulations. The exception is to ensure that enough extra space is allocated within the record so new information fields can be added to the record layout during a retrofit. Access Pattern The access pattern, the sequence in which successive records are retrieved or added to the dataset, is also application- specific, but is more amenable to manipulation for performance. The main categories for access pattern are: o Truly Random - no relationship between successive requests at all. A license-number database accessed by a law enforcement agency servicing requests from officers in the field is a good example; o Random Batch - The records are retrieved in random order within a certain range of keys. An example would be a payroll dataset in sequence by employee number within department in which a clerk processes transactions by department and randomly by employee within that department; o Sorted Batch - The records are retrieved by key, but successive requests, within a batch, are always in key sequence. In the payroll file above, the clerk processes by ascending employee within a department before processing requests for a new department; o Sequential - Access is always by entry sequence of the records in the dataset and no records are skipped. The desired order of access is by the order that the records were originally placed in the file. Typically, the entire dataset is perused each time the file is opened; o Skip-Sequential - In this access mode the records are retrieved in physical key sequence, but the starting point within the file is variable. Once again, in the mythical payroll file, the clerk reviews the records for all of the employees within a given department. Control Interval Size The control interval, referred to as a "CI", is VSAM's unit by which it transfers data between DASD and main storage. It is analogous to the "block" concept in other access methods. However, unlike block size, CI-size is not limited to an even multiple of the logical record length; the CI- size selected for a file is completely irrelevant to the actual processing of the dataset. VSAM will present, except in very rare instances, discrete records to the application program without the application having to be concerned with the control interval size. A control interval is always made up of an integral number of physical records, whose physical record size is dependent upon both the effective CI-size and the device type being used. (For FBA DASD, the physical record size will always be 512 bytes, regardless of CI-size.) Indeed, CI-size is probably the most sensitive single item that is available for optimization in a VSAM dataset. It directly affects the processing efficiency, amount of main storage required to process the application (called the "working set"), and the utilization of DASD space for good or ill. A file's CI-size is selected under the following broad constraints, which are enforced by VSAM: o Allowable Range - it must be between 512 bytes and 32,768 bytes. If less than 8K bytes it must be a multiple of 512 bytes, and if greater than 8K bytes it must be a multiple of 2K bytes. If you define a CI-size that violates this rule, VSAM will increase it up to the next multiple -- i.e., a CI- size of 800 bytes will be increased to 1024 bytes, and a CI- size of 9,216 will be increased to 10,240. Any extra space in a CI that is allocated this way may not be available for storing records, if the RECORDSIZE is larger than the difference between the selected (in the IDCAMS "DEFINE" command) and the actual CI-sizes (CI-size generated by VSAM); o Recordsize versus CI-size - The largest RECORDSIZE in a non-spanned file can be no larger than the CI-size less 7 bytes. The 7 bytes are required for control information required by VSAM (however, see below about the actual space consumed in a VSAM control interval); o Spanned Record - The space available within a SPANNED record's individual control intervals is the CI-size minus ten (10) bytes -- 4 bytes for the CIDF and 3 bytes for the record segment's RDF and 3 bytes for the RDF that holds the CI's level check. A SPANNED record is one whose RECORDSIZE may be larger than a single CI. An RRDS cluster cannot be defined as SPANNED; o Record Count Within CI - As many logical records as possible will fit in a single control interval ("CI"). In a non-spanned VSAM file, if the remaining space in a CI is less than the maximum record length specified in the file definition, no record will be written to the CI, even if that record would fit. In a spanned VSAM file (a file where a logical record may span control intervals), only space remaining in the last CI of a record is not used, even if the space would be sufficient for a smaller size record; o Default CI-Sizes - If you do not specify a CI-size when you define a cluster, VSAM will allocate a CI-size for you (usually not what you would want, though). The rules are simple -- if you specified the "RECORDSIZE" parameter in the DEFINE, and the size of the record permits, VSAM will assign a CI-size of 2048 bytes [2K]. If no RECORDSIZE was specified, VSAM will assign a 4096 byte [4K] CI. Also, if the file is a KSDS dataset, VSAM will default to a 512-byte CI-size for the index component. If the default CI-size is too small for the record size specified, VSAM will allocate a CI-size at the next-allowable multiple (multiples of 512 bytes if the data component RECORDSIZE is less than 8K, and multiples of 2K if the record size is greater than or equal to 8K); o Index CI-Size - The index CI for a KSDS dataset must be no larger than 8K, and must be a multiple of 512 bytes; o Forced Rounding - If you specify an improper multiple of 512 or 2048 bytes, VSAM will round the CI-size used up to the nearest allowable multiple. VSAM will not issue any message indicating that the CI-size you have selected has been overridden. The primary access pattern of an application is very important when selecting which CI-size size to use, as this parameter (CONTROLINTERVALSIZE) will have a very direct impact on the performance of your application. For sequential or skip-sequential access, larger control intervals are beneficial. Since a larger CI-size allows a greater number of logical records to fit into each CI, fewer control intervals will need to be transferred between DASD and main storage to process a set number of records, thus reducing I/O time. However, for a randomly searched file, unless the access pattern results in a high "hit ratio" within very tight key ranges, a smaller CI-size is preferable. The larger CI would, in this case, be causing access to auxiliary storage to retrieve records within the CI that will not be needed, as each CI would only have a few (or just a single) record that will be referenced, and the larger CI would be under VSAM exclusive control, potentially tying up records that another user (in the online environment) wants to access. Also, consider the potential time wasted reading and writing these control intervals to and from storage, when the records actually needed are such a small percentage. For a KSDS file, a larger CI-size allows for more efficient distribution of the free space (after the file is loaded, as more records are added, more records will fit into each CI, thus requiring fewer CI-splits, and fewer control area ["CA"] splits) and fewer index records are required, as there will be fewer control intervals to point to. A larger control interval size is also beneficial for a randomly accessed file if the retrieval is by pre-sorted input keys, or the access is randomly within a "tight" key range, as VSAM will, if possible, reference any CI in its in-core buffers before reading from DASD. CI-size will affect main and auxiliary storage requirements also: o As RECORDSIZE increases, you may need larger control intervals to hold the records; o Poor choices for CI-size affect DASD utilization - i.e., a 150-byte record will only fit 13 same-length records into a 2K CI for a KSDS cluster, thus wasting 88 bytes: (2048 - 10 - 1950 = 88), where: 2048 --> control interval size; 10 --> control information [1 CIDF and 2 RDFs]; 1950 --> space required for record storage [13 * 150]; 88 --> space remaining within the CI. This 88 bytes represents 4.3 percent of each CI. Raising the CI-size to 4096 allows 27 records of this same 150-byte length to fit into the control interval, with an excess of only 36 bytes (4096 - 10 - 4050 = 36), where the values are calculated as above for the 2K CI. These 39 bytes represent less than a one percent waste within the CI. For a 50,000 record file, a 2K CI-size (which requires four 512-byte FBA blocks) will require 3847 CIs (50,000/13 = 3846.15), or 15,388 blocks. These same 50,000 records in the 4K CI (which uses 8 FBA blocks) will require 1582 CIs (50.000/27 = 1851.8), or 14,816 FBA blocks. If the 15,388 block allocation for the 2K CI is rounded up to 15,392 (an even multiple of 8 FBA blocks), this CI-size would hold 50,024 records. Used for a 4K CI, these same 15,392 FBA blocks hold 51,948 records. This represents a 4 percent increase (1,924 records) in the same space. Actually, due to the MAX-CA and MIN-CA rounding that VSAM will perform, these allocations (15,388 and 15,392 blocks) will actually be 15,438 blocks -- an allocation that represents 249 MIN-CAs for a 3370 FBA DASD; o As control interval size increases, the allocated buffer space has to expand to hold the larger control interval; o For an indexed cluster, if the data component CI-size is small, and there are many potential data CIs in a single control area, the default (or selected) index component's CI-size may not be large enough at the lowest level (the sequence set) CI to address all the potential data CIs in the control area, forcing VSAM to leave some of the control intervals in the control area empty, thus wasting the unused space. If the sequence set index record cannot address all the CIs in the control area the sequence set record references, the CIS cannot be allocated for use for records, but VSAM has already allocated the space for this control area. (See the section "Index Options" in next months installment to this series.) Control Area Size A Control Area ("CA") is the unit of DASD storage that VSAM will allocate and preformat when loading or expanding a file. With the advent of the linear addressing scheme for the FBA DASD devices (3310/3370) the former practice of allocating space by tracks and cylinders to optimize performance is no longer as conceptually clear, even though the techniques and the end result are the same. The terms "MIN-CA" and "MAX-CA" are now used in place of the terms "track" and "cylinder", respectively, when discussing VSAM space allocations. Thus, even though FBA DASD is addressed in terms of 512-byte blocks, these devices are physically configured as track and cylinder. VSAM will always allocate space in multiples of MIN-CA (for FBA a MIN-CA = 62 blocks [31K bytes]) up to the MAX-CA for the device (for FBA MAX-CA = 744 blocks [372K bytes]). A control area is always made up of an integral number of control intervals, but performance is enhanced if a whole number of control areas will fit into a single MAX-CA, due to the command chaining to read an entire cylinder at a time. VSAM will always allocate storage on MIN-CA and MAX- CA boundaries (allocations may split cylinders but never tracks). The size of the control area is indirectly significant for sequentially organized files (RRDS/ESDS) due to this command chaining, but it has a direct impact on indexed files. The CA-size selected represents one of the primary considerations for how VSAM will allocate its indices, and how free space (and thus the incidence of CI and/or CA splits) is used. For an indexed file, VSAM also views a control area as space occupied by the number of control intervals that can be addressed by a single SEQUENCE SET index record. The sequence set record is that index record that holds the high-key marker and the physical location to address each CI in the CA. For an indexed file, the index record must be large enough to address all the control intervals in a control area; thus, for sequential/skip-sequential access of an indexed file (or an alternate index [AIX]), if there are more control intervals in the a control area, fewer index records need to be read, also, larger control areas will generally result in better performance and space utilization. Because the VSAM catalog keeps track of its allocations, no control information fields (such as the RDF and CIDF in the control interval) are required in the control area itself. Control area size cannot be directly specified, but the quantity that VSAM uses can be influenced by the choice of options that you can define when you describe the cluster via IDCAMS "DEFINE." VSAM will choose, for a given file's control area, the smaller of the MAX-CA for the device, the primary space allocation or the secondary space allocation for the cluster. As mentioned above, VSAM will preformat space by control areas when loading or extending a file. This process consists of VSAM writing end-of-file records across each control area it allocates before writing any data or index control intervals into the freshly-allocated control area. Thus, in theory, if the application should fail after starting to load or extend a dataset, a problem program could read to the end-of-file and resume loading/extending the dataset from that point. This preformatting is costly in terms of I/O time, as the control area is effectively being written twice -- once when preformatting and again on writing actual data records. If the cluster is to be initially loaded by a utility (DITTO, SORT/MERGE, VSAM REPRO), it is simpler, in the event of an ABEND on loading, to delete and redefine the cluster's catalog entry and reload the dataset from the beginning. In this case, the preformatting of the area on disk does not help. Indeed, it can be a serious performance bottleneck. The process of preformatting can be bypassed by specifying the option "SPEED" in the DEFINE for the cluster. This option applies only to the INITIAL LOAD, not when the file is being extended. "SPEED" is not the default attribute, and must be specified if desired. /* 2770