Tuesday, September 29, 2015

File Access Techniques

In my last blog my youngest brother commented, “Now you need to do a post on the transition from sequential access (mag tapes) to FAT access (or whatever they called the earliest version of non-sequential access).” While that might seem like a simple request, there are a lot of technical details that will make this blog entry fairly long and complicated.

However, before I get started, I think that it’s interesting to note how this question illustrates the rapid rate at which computing technology has changed. Even though my brother is only 10 years younger than me, he uses the term “FAT access” to describe non-sequential access. The term FAT is from PC technology and represents the computing age that he grew up in. But my computing memories are from the earlier mainframe era of the late 1960’s when person computing was not part of our vocabulary – or even of our vision of the future.

I’m going to try to address this topic by writing about three different file access methods, going in to a bit of technical detail on each, then finishing up with a couple of true stories from my early computer days that illustrate how one can get into trouble in dealing with files.

Sequential (Tape) Access

The standard magnetic tape when I entered the computer field was ½” wide and was on reels measuring either 1200’ or 2400’ long. The recording density in those days was either 200bpi (bits per inch), 556 bpi, or 800 bpi (later improvements added 1600 bpi and 6250 bpi). There were either 7 or 9 tracks across the tape (depending on the type of computer). At low density, a single 2400’ reel had a capacity of 5MB (200 bpi x 12 inches/foot x 2400 feet) – not very much in terms of today’s media, but the equivalent of 36 boxes of punch cards (2000 cards/box), a considerable compression ratio. It was possible with a magnifying glass to see the individual recorded spots on these low density tapes and to manually “read” the contents.

However, as with all physical media, it is not possible to match the processing speed of the computer with the speed of the device. Therefore, we must employ some method that allows us to manage the differences in speed. There are two primary methods with magnetic tape.

Blocking – Tape drives cannot start/stop “on a dime”. The tape is light so inertia is pretty low, but even so to go from reading speed (100+ inches/second) to zero and then back up to full speed means that we need to have a certain amount of “blank” space between the blocks of data. This IRG (Inter-Record Gap) was .75”. At 800 bpi, the equivalent of a punched-card-worth of data can be stored in .1”. That would mean that we have only .1” of data followed by .75” of gap, reducing the actual capacity of the tape to only 600K (4 boxes of cards). Therefore, we want to store multiple records together before having each IRG. If we increase the size of each data block from 80 to 800 we have 1” of data for every .75” of IRG. This also reduces the wear and tear on the tape (and the tape drive) as we are not having to stop/start for every record, but only every 10 records.

But one might logically ask, why not just make the blocks really big (say 10” or 8000 characters long). The answer is that we are limited in the size of main computer memory in those days. Before I finish this answer, let me discuss the other method used to help match computer and tape drive speeds.

Buffering – It’s best if the computer never has to wait for a physical tape movement. One of the ways to do this is to “read ahead”, i.e. for the tape drive to be read and put one block of data into main memory while the computer is processing the data from the previous block. This was nearly always done, but it doubled the amount of main memory required.

Back to limitations – if one has a simple program that is just reading from one tape, performing some simple procedure, and writing the results to another tape, the program will require four buffers (two for each tape), each of which might be 800 characters long, for a total of 3200 characters of data. Early computers had limited computer memory – primarily because it was really expensive. An average computer might only have 32K of main memory. That means that even this simple program is using 10% of the memory just for data buffers, reducing the amount available for both the program processing the data as well as the operating system. If we made our blocks really big (the 8000 characters as mentioned above), our four buffers would consume all of available main memory, leaving none for the program and operating system!

Such were the limitations of even sequential tape processing in the early days.

Direct (Relative) Access

While there were some disk systems prior to the mid-1960s they were very expensive and not commonly used. For example, the computer that I learned on (a CDC 3600 – one of the early super-computers) had high-speed drum storage. This drum was 4’ long and about 1’ in diameter and had a head per track. However, the amount of storage on it was only 32K words (there were six bits per character and 36 characters per word, so it was a 192K computer in today’s measures) – the same as the main memory of the computer. It was only used for swapping out programs so that we could run more than one program. Certainly it was too expensive to “waste” for any data storage.

One of the first removable media disks were introduced by IBM in 1962, the IBM 1311. Two years later, co-incidental with the introduction of the IBM 360 line of computer, the 2311 was introduced. The 2314 was introduced on year later. The 2311 had a capacity of 7M per storage device, the 2314 had a capacity of 28M. These may seem pretty trivial these days, but that was a LOT of capacity back then.

One of the significant differences between tape and disk is that the disk is constantly spinning so no IRG is needed. However, you are limited in the size of one block of data to the amount of data that can be stored on a single track (there are 2000 tracks per unit in a 2311 so that would be 3.5K). But the second difference is that one was no longer limited to reading the tracks in sequential order, but you access them randomly. This gave rise to the first type of direct/relative access.

The term “relative” is used because each block of data was addressed relative to the first block of that file – for example, the 10th block is the same distance from the beginning regardless of whether the file occupies blocks 100-199 or 550-649. The question then is reduced to trying to determine which relative block a given record should be stored in.

Most real-world data is not nicely numbered from 1-N, so we need an algorithm to “map” the key of the data to the range of 1-N (actually 0 to N-1). If the data key is numeric, then a simple way might be to divide by the number of blocks and take the remainder (note that this is the origin of calling something a “customer number” even though many computer systems have customers identified by a key which may contain other than just digits – in the early days we really did use only numbers for ease of accesses like this one). I won’t go into any more detail, but there is a lot involved in doing that mapping in the best way possible.

But there was another problem – what if two records mapped to the same place? You can’t just overwrite what is there. The usual solution is to then start reading sequentially from that point and look for an available opening. So for example if our mapping was to take the last two digits of the key we would initially try to store customer 429 in block 29. But if customer 229 was already there, then we would look at block 30, then 31, etc. until we found an unoccupied block and put customer 429 there (say for example in block 32). In doing lookups we would follow the same type of logic, looking for customer 429 in block 29, and if not finding it there, then looking in subsequent blocks. Again, I won’t go into more detail here. But the complexities of using these kinds of files meant that they were not heavily used.

Indexed (Indexed-Sequential) Access

Although computing was taking off beginning in the early 1960s, most processing was still using sequential files. The COmmon Business Oriented Language (COBOL) was initially made available in 1960 and quickly became the most often used language to solve business problems. Indexed files did not appear in COBOL until the release of COBOL-68. Indexed files were implemented by having three separate areas of a disk for different purposes. The index area had one record for each block of data and listed the first and last key in the block as well as a pointer to it. The data area was where the blocks of information were stored. Besides the data there were a few other pointers, including one to the overflow area (if that block had overflown [more on this below], and the overflow consisted of unblocked records that could not fit in the block in the data area to which they would normally be assigned.

If records were deleted, then they were simply marked (most generally by putting all binary 1s in the first byte of the record). If records were added and there were no spaces available in the block to which they were assigned, then a record would be pushed into overflow. For example, if a block contained records with keys of 1, 2, 3, 5, 9 and a record with key of 7 was added, then the 7 record would be placed in the main block (1, 2, 3, 5, 7) and the record with a key of 9 would be placed into overflow and a pointer to it placed in the main data block. If another record with a key of 4 were added, the main data block would become (1, 2, 3, 4, 5), the record with key of 7 would go into overflow. The main block would have a pointer to the “7” record and the “7” record would have a pointer to the “9” record. Periodically, the indexed file would be “unloaded” into a sequential file and then reloaded so that all records would be in the main data area again.

Horror Stories

A Sequential File Nightmare – In the early 1970s I was working for Uniroyal at their then corporate headquarters in Oxford, CT. We had a situation where we had one tape for each month of the year and we wanted them merged so that if there were records for the same customer in multiple months that the information would be combined in a single record. The person assigned to write this program was a supposedly experienced programmer by the name of Al Love who had come to us from New York City. He attempted to write a program that did a 12-way match, i.e. 12 tapes in and one tape out. Our large mainframe at the time had 16 tape drives, but it normally ran four concurrent programs and had four drives assigned to each of the “partitions”. In order for Al to run his program, the operators had to wait until all the programs currently running had completed, then reassign all the tape drives to a single partition. They would then mount 12 tapes on the input drives and one blank tape on the output drive. Inevitably, Al’s program would crash, then the operators would have to dismount all 13 tapes, manually reassign the tape drives to the four partitions and then start running “normal” programs. The operators were NOT happy when they saw a test of Al’s program in their input queue.

After two months of trying to get his program to run, management got fed up with Al and he was terminated. They felt they had wasted two months of salary for an experienced programmer. They reassigned the job to a new hire. He knew that he was not skilled enough to write such a program, so he wrote a two-way match program (two tapes in, one out), and ran it 11 times, each time merging in one more month of data.

As I revisited Al’s problem later I realized that he was trying to identify and properly handle 2*12-1 possibilities (record in file 1 but no others, record in file 1 and 2 but not others, …) That was going to require 4095 IF statements with 12 parts or over 50,000 lines of code just to identify all the situations. No way his solution was ever going to work! There was another way that reduced the solution set to less than 20 lines of code for the central decision making IF statement (a 2500-fold reduction in code).

The Long-running Indexed File Update – A few years later I was working for the Winchester Rifle division of Olin Corporation at their division headquarters in New Haven, CT. One of my roles was overseeing our online system and I had written a number of utilities to help me in my work. One morning I came to work and was confronted by the operations staff who were very worried that an update program they had started running early the previous evening was still running. They wanted to know if they should cancel it or if it was going to end soon. I pulled out the program compilation for the program that was executing. It was a indexed file update. It seems that we had recently acquired a large number of customers from another company and the prior night we had included all the new customers into the customer-add-program.

One of my online utilities enabled me to enter a memory address and see what was in memory (while the program was running). Cross-partition reading was allowed (but cross-partition updating obviously not). Knowing the addresses for the start of each partition (we were running a version of the operating system known as MFT where the “F” stands for Fixed partitions), I looked at the compilation listing which showed where in the partition the various parts of the program and data would be stored – in particular where the file buffers were located.

I located the buffers using my utility and was able to determine the key of the record that was currently being processed. In my examination of the program it looked like the problem was that all these new customers had been assigned customer numbers in sequential order. Thus each new customer would land in the same block of the file, requiring a record to be pushed off into the unblocked overflow area. Since the file was sorted in ascending order, each new record would then require that the computer trace down the overflow chain for that block and add the new record to the end. The first record would require only 1 read of the overflow chain, but the 2nd would require 2 reads, the 3rd would require 3 reads, etc. The chain was growing longer with each new customer to be added. The program had run for 12 hours already and based on my analysis of how many customers had been added and how many there were to go, I estimated it would take another two days to complete. This was not an acceptable solution! So we cancelled the running program and restored the customer master file (always take backups before running an update!)

The solution was pretty simple – I just sorted the file of new customer in descending order instead of ascending order. Now for each add, the program would start tracing down the overflow chain, find that the first record on the chain had a higher key and simply drop the new record into the chain right there. When we restarted the update the estimated 60 hour job was completed in just 5 minutes. Knowing how indexed file operated saved the day!

Conclusion

In the forty years since the above incidents, file systems have gotten increasingly larger and more complicated. We now tend to use databases for many things and let the database software deal with all the details. But “under the covers” some of the same things mentioned above are going on, we just don’t have to deal with them.


I certainly learned a lot during my involvement in the early days of computing. And knowing those details often helped with my understanding of what was happening behind the scenes as technology advanced and got more complicated. But I’m glad to be retired and not having to deal with it any more.

1 comment:

  1. Glad you simplified that one! When I was at university, they introduced DEC-tapes for or DEC-10 computer. These were indexed tapes. They were wider and much fatter than a standard mag tape. I don't remember how much they held, or how long the technology persisted.

    Now I think of my 1 TB hard drive on my computer. Just the index file alone must be huge. And managing large numbers of both large (image and video) files and small files (especially all those temporary things from web pages) makes finding an ideal block size complicated.

    Some of the residuals of those days hung around a while. I remember our computer guy at the office back in the late 90s using the "mount" command on our Unix system to get the computer to recognize a drive that was already installed. When I used the "mount" command in college, some guy had to go to the rack of mag tapes, find the one I specified and physically mount it on the tape drive.

    Imagine if I had to do that every time I stuck something in a USB port or optical drive!

    ReplyDelete