In
my last blog my youngest brother commented, “Now you need to do a post on the
transition from sequential access (mag tapes) to FAT access (or whatever they
called the earliest version of non-sequential access).” While that might seem
like a simple request, there are a lot of technical details that will make this
blog entry fairly long and complicated.
However,
before I get started, I think that it’s interesting to note how this question
illustrates the rapid rate at which computing technology has changed. Even
though my brother is only 10 years younger than me, he uses the term “FAT
access” to describe non-sequential access. The term FAT is from PC technology
and represents the computing age that he grew up in. But my computing memories
are from the earlier mainframe era of the late 1960’s when person computing was
not part of our vocabulary – or even of our vision of the future.
I’m
going to try to address this topic by writing about three different file access
methods, going in to a bit of technical detail on each, then finishing up with
a couple of true stories from my early computer days that illustrate how one
can get into trouble in dealing with files.
Sequential (Tape) Access
The
standard magnetic tape when I entered the computer field was ½” wide and was on
reels measuring either 1200’ or 2400’ long. The recording density in those days
was either 200bpi (bits per inch), 556 bpi, or 800 bpi (later improvements
added 1600 bpi and 6250 bpi). There were either 7 or 9 tracks across the tape
(depending on the type of computer). At low density, a single 2400’ reel had a
capacity of 5MB (200 bpi x 12 inches/foot x 2400 feet) – not very much in terms
of today’s media, but the equivalent of 36 boxes of punch cards (2000 cards/box),
a considerable compression ratio. It was possible with a magnifying glass to
see the individual recorded spots on these low density tapes and to manually “read”
the contents.
However,
as with all physical media, it is not possible to match the processing speed of
the computer with the speed of the device. Therefore, we must employ some
method that allows us to manage the differences in speed. There are two primary
methods with magnetic tape.
Blocking –
Tape drives cannot start/stop “on a dime”. The tape is light so inertia is pretty
low, but even so to go from reading speed (100+ inches/second) to zero and then
back up to full speed means that we need to have a certain amount of “blank”
space between the blocks of data. This IRG (Inter-Record Gap) was .75”. At 800
bpi, the equivalent of a punched-card-worth of data can be stored in .1”. That
would mean that we have only .1” of data followed by .75” of gap, reducing the
actual capacity of the tape to only 600K (4 boxes of cards). Therefore, we want
to store multiple records together before having each IRG. If we increase the
size of each data block from 80 to 800 we have 1” of data for every .75” of
IRG. This also reduces the wear and tear on the tape (and the tape drive) as we
are not having to stop/start for every record, but only every 10 records.
But
one might logically ask, why not just make the blocks really big (say 10” or
8000 characters long). The answer is that we are limited in the size of main
computer memory in those days. Before I finish this answer, let me discuss the
other method used to help match computer and tape drive speeds.
Buffering –
It’s best if the computer never has to wait for a physical tape movement. One
of the ways to do this is to “read ahead”, i.e. for the tape drive to be read
and put one block of data into main memory while the computer is processing the
data from the previous block. This was nearly always done, but it doubled the
amount of main memory required.
Back
to limitations – if one has a simple program that is just reading from one
tape, performing some simple procedure, and writing the results to another
tape, the program will require four buffers (two for each tape), each of which
might be 800 characters long, for a total of 3200 characters of data. Early
computers had limited computer memory – primarily because it was really
expensive. An average computer might only have 32K of main memory. That means
that even this simple program is using 10% of the memory just for data buffers,
reducing the amount available for both the program processing the data as well
as the operating system. If we made our blocks really big (the 8000 characters
as mentioned above), our four buffers would consume all of available main memory,
leaving none for the program and operating system!
Such
were the limitations of even sequential tape processing in the early days.
Direct (Relative) Access
While
there were some disk systems prior to the mid-1960s they were very expensive
and not commonly used. For example, the computer that I learned on (a CDC 3600 –
one of the early super-computers) had high-speed drum storage. This drum was 4’
long and about 1’ in diameter and had a head per track. However, the amount of
storage on it was only 32K words (there were six bits per character and 36
characters per word, so it was a 192K computer in today’s measures) – the same
as the main memory of the computer. It was only used for swapping out programs
so that we could run more than one program. Certainly it was too expensive to “waste”
for any data storage.
One
of the first removable media disks were introduced by IBM in 1962, the IBM
1311. Two years later, co-incidental with the introduction of the IBM 360 line
of computer, the 2311 was introduced. The 2314 was introduced on year later.
The 2311 had a capacity of 7M per storage device, the 2314 had a capacity of
28M. These may seem pretty trivial these days, but that was a LOT of capacity
back then.
One
of the significant differences between tape and disk is that the disk is
constantly spinning so no IRG is needed. However, you are limited in the size
of one block of data to the amount of data that can be stored on a single track
(there are 2000 tracks per unit in a 2311 so that would be 3.5K). But the
second difference is that one was no longer limited to reading the tracks in
sequential order, but you access them randomly. This gave rise to the first
type of direct/relative access.
The
term “relative” is used because each block of data was addressed relative to
the first block of that file – for example, the 10th block is the
same distance from the beginning regardless of whether the file occupies blocks
100-199 or 550-649. The question then is reduced to trying to determine which
relative block a given record should be stored in.
Most
real-world data is not nicely numbered from 1-N, so we need an algorithm to “map”
the key of the data to the range of 1-N (actually 0 to N-1). If the data key is
numeric, then a simple way might be to divide by the number of blocks and take
the remainder (note that this is the origin of calling something a “customer
number” even though many computer systems have customers identified by a key
which may contain other than just digits – in the early days we really did use
only numbers for ease of accesses like this one). I won’t go into any more
detail, but there is a lot involved in doing that mapping in the best way
possible.
But
there was another problem – what if two records mapped to the same place? You
can’t just overwrite what is there. The usual solution is to then start reading
sequentially from that point and look for an available opening. So for example
if our mapping was to take the last two digits of the key we would initially
try to store customer 429 in block 29. But if customer 229 was already there,
then we would look at block 30, then 31, etc. until we found an unoccupied
block and put customer 429 there (say for example in block 32). In doing
lookups we would follow the same type of logic, looking for customer 429 in
block 29, and if not finding it there, then looking in subsequent blocks.
Again, I won’t go into more detail here. But the complexities of using these
kinds of files meant that they were not heavily used.
Indexed
(Indexed-Sequential) Access
Although
computing was taking off beginning in the early 1960s, most processing was
still using sequential files. The COmmon Business Oriented Language (COBOL) was
initially made available in 1960 and quickly became the most often used
language to solve business problems. Indexed files did not appear in COBOL
until the release of COBOL-68. Indexed files were implemented by having three
separate areas of a disk for different purposes. The index area had one record
for each block of data and listed the first and last key in the block as well
as a pointer to it. The data area was where the blocks of information were
stored. Besides the data there were a few other pointers, including one to the
overflow area (if that block had overflown [more on this below], and the
overflow consisted of unblocked records that could not fit in the block in the
data area to which they would normally be assigned.
If
records were deleted, then they were simply marked (most generally by putting
all binary 1s in the first byte of the record). If records were added and there
were no spaces available in the block to which they were assigned, then a
record would be pushed into overflow. For example, if a block contained records
with keys of 1, 2, 3, 5, 9 and a record with key of 7 was added, then the 7
record would be placed in the main block (1, 2, 3, 5, 7) and the record with a
key of 9 would be placed into overflow and a pointer to it placed in the main
data block. If another record with a key of 4 were added, the main data block
would become (1, 2, 3, 4, 5), the record with key of 7 would go into overflow.
The main block would have a pointer to the “7” record and the “7” record would
have a pointer to the “9” record. Periodically, the indexed file would be “unloaded”
into a sequential file and then reloaded so that all records would be in the
main data area again.
Horror Stories
A Sequential File
Nightmare – In the early 1970s I was working for
Uniroyal at their then corporate headquarters in Oxford, CT. We had a situation
where we had one tape for each month of the year and we wanted them merged so
that if there were records for the same customer in multiple months that the
information would be combined in a single record. The person assigned to write
this program was a supposedly experienced programmer by the name of Al Love who
had come to us from New York City. He attempted to write a program that did a
12-way match, i.e. 12 tapes in and one tape out. Our large mainframe at the
time had 16 tape drives, but it normally ran four concurrent programs and had
four drives assigned to each of the “partitions”. In order for Al to run his
program, the operators had to wait until all the programs currently running had
completed, then reassign all the tape drives to a single partition. They would
then mount 12 tapes on the input drives and one blank tape on the output drive.
Inevitably, Al’s program would crash, then the operators would have to dismount
all 13 tapes, manually reassign the tape drives to the four partitions and then
start running “normal” programs. The operators were NOT happy when they saw a
test of Al’s program in their input queue.
After
two months of trying to get his program to run, management got fed up with Al
and he was terminated. They felt they had wasted two months of salary for an
experienced programmer. They reassigned the job to a new hire. He knew that he
was not skilled enough to write such a program, so he wrote a two-way match
program (two tapes in, one out), and ran it 11 times, each time merging in one
more month of data.
As
I revisited Al’s problem later I realized that he was trying to identify and
properly handle 2*12-1 possibilities (record in file 1 but no others, record in
file 1 and 2 but not others, …) That was going to require 4095 IF statements
with 12 parts or over 50,000 lines of code just to identify all the situations.
No way his solution was ever going to work! There was another way that reduced
the solution set to less than 20 lines of code for the central decision making
IF statement (a 2500-fold reduction in code).
The Long-running Indexed
File Update – A few years later I was working for the
Winchester Rifle division of Olin Corporation at their division headquarters in
New Haven, CT. One of my roles was overseeing our online system and I had
written a number of utilities to help me in my work. One morning I came to work
and was confronted by the operations staff who were very worried that an update
program they had started running early the previous evening was still running.
They wanted to know if they should cancel it or if it was going to end soon. I
pulled out the program compilation for the program that was executing. It was a
indexed file update. It seems that we had recently acquired a large number of
customers from another company and the prior night we had included all the new
customers into the customer-add-program.
One
of my online utilities enabled me to enter a memory address and see what was in
memory (while the program was running). Cross-partition reading was allowed
(but cross-partition updating obviously not). Knowing the addresses for the
start of each partition (we were running a version of the operating system
known as MFT where the “F” stands for Fixed partitions), I looked at the
compilation listing which showed where in the partition the various parts of
the program and data would be stored – in particular where the file buffers
were located.
I
located the buffers using my utility and was able to determine the key of the
record that was currently being processed. In my examination of the program it
looked like the problem was that all these new customers had been assigned
customer numbers in sequential order. Thus each new customer would land in the
same block of the file, requiring a record to be pushed off into the unblocked
overflow area. Since the file was sorted in ascending order, each new record
would then require that the computer trace down the overflow chain for that
block and add the new record to the end. The first record would require only 1
read of the overflow chain, but the 2nd would require 2 reads, the 3rd
would require 3 reads, etc. The chain was growing longer with each new customer
to be added. The program had run for 12 hours already and based on my analysis
of how many customers had been added and how many there were to go, I estimated
it would take another two days to complete. This was not an acceptable
solution! So we cancelled the running program and restored the customer master
file (always take backups before running an update!)
The
solution was pretty simple – I just sorted the file of new customer in
descending order instead of ascending order. Now for each add, the program
would start tracing down the overflow chain, find that the first record on the
chain had a higher key and simply drop the new record into the chain right
there. When we restarted the update the estimated 60 hour job was completed in
just 5 minutes. Knowing how indexed file operated saved the day!
Conclusion
In
the forty years since the above incidents, file systems have gotten
increasingly larger and more complicated. We now tend to use databases for many
things and let the database software deal with all the details. But “under the
covers” some of the same things mentioned above are going on, we just don’t
have to deal with them.
I
certainly learned a lot during my involvement in the early days of computing.
And knowing those details often helped with my understanding of what was
happening behind the scenes as technology advanced and got more complicated.
But I’m glad to be retired and not having to deal with it any more.