
Return to the table of contents
In this chapter we will develop an algorithm that uses the quantum file access method to handle a file containing a large number of records each of which can vary dynamically in size. In order to appreciate the power of this access method, we will start by considering the much simpler problem of access to fixed-length records.
The Quantum File Access Method
Let us suppose that we want to access a number of fixed-length customer records by their record numbers.1 Given the record number, we can locate that customer's record by multiplying the record number by the length of a record, which gives us the offset into the file where that record will be found. We can then read or write that record as needed.
Of course, a real application needs to reuse records of customers who have become inactive, to prevent the customer file from growing indefinitely. To handle this problem, we could set a limit on the file size and, when it is reached, start reusing records that haven't been referenced for a long time, making sure to correct or delete any records in other files that refer to the deleted customer.
This fixed-length record system is not very difficult to implement, but it has significant drawbacks; address fields, for example, tend to vary greatly in length, with some records needing 25 or 30 characters for a city name or street address and others needing only 10. If we allocate enough storage for the extreme case, the records become much longer than if we had to handle only the average case. However, allocating enough for only the average case will leave those customers whose names or addresses won't fit into the allotted space quite unhappy, as I know from personal experience as a software developer! The obvious solution is to allow the fields (and therefore the records) to vary in length as necessary.
Unfortunately, variable-length records are much more difficult to deal with than fixed-length ones. The most obvious reason, as discussed above, is that determining where fixed-length records are stored requires only a simple calculation; this is not true of variable-length records. However, we could remedy this fairly easily by maintaining a record index consisting of an array of structures containing the starting location and length of each record, as depicted in Figure recindex.2
+- Starting Record
| Index Address Length
| +---------------+
| 0 | 0 12 +----+
Record | 1 | 12 12 +----++
Index | 2 | 24 7 +----++-------+
Array | 3 | 31 2 +----++-------+------+
| 4 | 33 5 +----++-------+------+-+
| +---------------+ || | | |
+- +--------------+| | | |
| +---+ | | |
+- | | | | |
| ++-----------+-----------+------+-+----+
Record | |Steve HellerP.O.Box 0335BaldwinNY11510|
Data | +--------------------------------------+
+-
+-
Record | 0000000000111111111122222222223333333333
Offset | 0123456789012345678901234567890123456789
+-
We encounter a more serious difficulty when we want to delete a record and reuse the space it occupied.3 In some situations we can sidestep the problem by adding the new version of the record to the end of the file and changing the record pointer to refer to the new location of the record; however, in the case of an actively updated file such an approach would cause the file to grow rapidly.
But how can we reuse the space vacated by a deleted record of some arbitrary size? The chances of a new record being exactly the same size as any specific deleted one are relatively small, especially if the records average several hundred bytes each, as is not at all unusual in customer data files. A possible solution is to keep a separate free list for each record size and reuse a record of the correct size. However, there is a very serious problem with this approach: a new record may need 257 bytes, for example, and there may be no available record of exactly that size. Even though half of the records in the file might be deleted, none of them could be reused, and we would still be forced to extend the file. The attempt to solve this difficulty by using a record that is somewhat larger than necessary leads to many unusably small areas being left in the file (a situation known as fragmentation).
However, there is a relatively unknown way to make variable-length records more tractable: the quantum file access method.4 The key is to combine them into groups of fixed length, which can then be allocated and deallocated in an efficient manner.
Before the following discussion will make much sense to you, I will need to explain in general terms what we're trying to accomplish: building a virtual memory system that can accommodate records of varying lengths in an efficient manner. This means that even though at any given time, we are storing most of our data on the disk rather than maintaining it all in memory, we will provide access to all the data as though it were in memory. To do this, we have to arrange that any actively used data is actually in memory when it is needed. In the present application, our data is divided into fixed-size blocks called quanta (plural of quantum),5 so the task of our virtual memory system is to ensure that the correct blocks are in memory as needed by the user.6 The quanta in the file are generally divided into a number of addressable units called items.7 When adding a record to the file, we search a free space list, looking for a quantum with enough free space for the new record. When we find one, we add the record to that quantum and store the record's location in the item reference array, or IRA, which replaces the record index in Figure recindex; this array consists of entries of the form "quantum number, item number".8 The item number refers to an entry in the item index stored at the beginning of the quantum; the items are stored in the quantum in order of their item index entries, which allows the size of an item to be calculated rather than having to be stored.
For example, if we were to create an array of variable-length strings, some of its item references might look like those illustrated in Figure itemref1.
+-
| Quantum Item
| Index Number Number
| +-------------+
Item | 0 | 3 1 +-----------------+
Reference | 1 | 3 2 +-----------------+-+
Array | 2 | 3 3 +-----------------+-+--+
(IRA) | 3 | 3 4 +-----------------+-+--+-+
| 4 | 3 5 +-----------------+-+--+-+-+
| +-------------+ | | | | |
+- | | | | |
| | | | |
+- Item # Offset Type Index | | | | |
| +--------------------------+ | | | | |
| 1 | 12 -+ VARSTRING 0 |±--+ | | | |
| 2 | 24 +| VARSTRING 1 |±----+ | | |
Item | 3 | +-31 || VARSTRING 2 |±-------+ | |
Index | 4 | ++-33 || VARSTRING 3 |±---------+ |
for | 5 |+++-38 || VARSTRING 4 |±-----------+
Quantum | 6 |||| 38 || UNUSED 0 |
3 | 7 |||| 38 || UNUSED 0 |
| 8 |||| 38 || UNUSED 0 |
| 9 |||| 38 || UNUSED 0 |
| 10 |||| 38 || UNUSED 0 |
| ++++----++-----------------+
+- ||| |+-----------------+
||| +------+ |
||+----+ | |
+- |+---+ | | |
| +--+----+-+------+-----------+-----------+
Quantum | | 11510NYBaldwinP.O.Box 0335Steve Heller|
Data | +----------------------------------------+
+-
+-
Quantum | 333333333222222222211111111110000000000
Offset | 876543210987654321098765432109876543210
+-
When we delete an item from a quantum, we have to update the free space list entry for that quantum to reflect the amount freed, so the space can be reused the next time an item is to be added to the file. We also have to slide the remaining items in the quantum together so that the free space is in one contiguous block, rather than in slivers scattered throughout the quantum. With a record index like the one in Figure recindex, we would have to change the record index entries for all the records that were moved. Since the records that were moved might be anywhere in the record index, this could impose unacceptable overhead on deletions; to avoid this problem, we will leave the item index entry for a deleted item empty rather than sliding the other entries down in the quantum, so that the IRA is unaffected by changes in the position of records within a quantum. If we delete element 1 from the array, the resulting quantum looks like Figure itemref2.
+-
| Quantum Item
| Index Number Number
| +-------------+
Item | 0 | 3 1 +-------------------+
Reference | 1 | NONE 0 | |
Array | 2 | 3 3 +-------------------+----+
(IRA) | 3 | 3 4 +-------------------+----+-+
| 4 | 3 5 +-------------------+----+-+-+
| +-------------+ | | | |
+- | | | |
| | | |
+- Item # Offset Type Index | | | |
| +----------------------------+ | | | |
| 1 | 12 -+ VARSTRING 0 |±--+ | | |
| 2 | 12 | UNUSED 0 | | | |
Item | 3 | 19-+| VARSTRING 2 |±-------+ | |
Index | 4 | +--21 || VARSTRING 3 |±---------+ |
for | 5 |++--26 || VARSTRING 4 |±-----------+
Quantum | 6 ||| 38 || UNUSED 0 |
3 | 7 ||| 38 || UNUSED 0 |
| 8 ||| 38 || UNUSED 0 |
| 9 ||| 38 || UNUSED 0 |
| 10 ||| 38 || UNUSED 0 |
| +++-----++-------------------+
+- || |+-----------------+
|| +-----------+ |
|+---------------+ | |
+- +-----------+ | | |
| +--------------+----+-+------+-----------+
Quantum | | 11510NYBaldwinSteve Heller|
Data | +----------------------------------------+
+-
+-
Quantum | 333333333222222222211111111110000000000
Offset | 876543210987654321098765432109876543210
+-
Note that we set the offset value of the deleted item equal to the offset of the item before it so that its length can be calculated as zero. Now let us add a new street address of "123 Main Street". After this change, our data structures look like those in Figure itemref3.
+-
| Quantum Item
| Index Number Number
| +-------------+
Item | 0 | 3 1 +------------------+
Reference | 1 | 3 2 +------------------+-+
Array | 2 | 3 3 +------------------+-+--+
(IRA) | 3 | 3 4 +------------------+-+--+-+
| 4 | 3 5 +------------------+-+--+-+-+
| +-------------+ | | | | |
+- | | | | |
| | | | |
+- | | | | |
|Item # Offset Type Index | | | | |
| +---------------------------+ | | | | |
| 1 | 12 -+ VARSTRING 0 |±--+ | | | |
| 2 | 27 +| VARSTRING 1 |±----+ | | |
Item | 3 | +-34 || VARSTRING 2 |±-------+ | |
Index | 4 | ++-36 || VARSTRING 3 |±---------+ |
for | 5 +-+-++-41 || VARSTRING 4 |±-----------+
Quantum | 6 | | || 38 || UNUSED 0 |
3 | 7 | | || 38 || UNUSED 0 |
| 8 | | || 38 || UNUSED 0 |
| 9 | | || 38 || UNUSED 0 |
| 10 | | || 38 || UNUSED 0 |
| | +-++----++------------------+
+- | || |+-----------------+
| || +---+ |
| |+-+ | |
+- | ++ | | |
| ++----+-+------+--------------+-----------+
Quantum | |11510NYBaldwin123 Main StreetSteve Heller|
Data | +-----------------------------------------+
+-
+-
Quantum | 443333333333222222222211111111110000000000
Offset | 109876543210987654321098765432109876543210
+-
Of course, the deleted item still takes up a slot in the item index of the old quantum; in our implementation this requires 10 bytes of storage.9 We need a method of reclaiming that space, so that we do not gradually fill our quanta with indexes pointing to nothing. We have already seen part of the solution to this problem: whenever we are adding an item to a quantum, we search for an unused item index entry rather than simply adding a new entry at the end.
This mechanism alone would only slow the growth of the item indexes; actually reducing the size of an index requires us to check for empty entries at the end of the item index after deleting an item. If such excess entries exist, we reclaim the space they occupy by reducing the entry count in the item index header to the number actually in use. This reduces wasted space in the item index without affecting entries in the IRA. Since we remove unused item index entries only when they are at the end of the index, no IRA entries can refer to items later in the same quantum.
Our records are stored in the quanta pointed to by our IRA. The IRA will be stored in its own quanta, which allows us to make it as large as needed. However, this means that we can't allocate the IRA in one piece. Since a quantum has a fixed size, we will have to break up our IRA into segments that will fit in a quantum; these segments will be collectively referred to as the little pointer array. We'll also need some way to find the segment that we need for a particular element in the IRA, which will be the responsibility of the big pointer array.
Figure bigpointer shows the relationship between the big pointer array and the IRA segments.
+------------------------------------------+
| +- |
| Big | element_count = 8000 |
| Pointer | last_quantum_added_to = 3 |
| Array | |
| Header | |
| +- |
| +- |
| | Quantum | Quantum
| | Index Number | 5
| | +-------------+ |
| Big | 0 | 2 +-------+---+
| Pointer | 1 | 10 +-------+---+-+
| Array | 2 | 12 +-------+---+-+--+
| | 3 | 14 +-------+---+-+--++
| | 4 | 11 +-------+---+-+--+++
| | +-------------+ | | | |||
| +- | | | |||
| | | | |||
+------------------------------------------+ | | |||
+--------------------------------------+ | | |||
| |±------+ | |||
| +- | Quantum | |||
| | Quantum Item | 2 | |||
| | Index Number Number | | |||
| Item | +-------------+ | | |||
| Reference | 0 | 3 2 | | | |||
| Array | 1 | 3 4 | | | |||
| (IRA) | 2 | 3 5 | | | |||
| segment | 3 | 3 3 | | | |||
| 0 | 4 | 3 1 | | | |||
| | +-------------+ | | |||
| +- | | |||
+--------------------------------------+ | |||
+--------------------------------------+ | |||
| |±--------+ |||
| +- | Quantum |||
| | Quantum Item | 10 |||
| | Index Number Number | |||
| Item | +-------------+ | |||
| Reference | 0 | 3 12 | | |||
| Array | 1 | 4 6 | | |||
| (IRA) | 2 | 4 1 | | |||
| segment | 3 | 4 2 | | |||
| 1 | 4 | 4 3 | | ²²²
| | +-------------+ |
| +- |
+--------------------------------------+
Because our ArrayIndex type is
defined as an unsigned long, which is usually 4 bytes, we
can actually create arrays of that size.10
There is one more generalization of the quantum file access method that will make it more generally useful: the ability to create more than one array of variable-length items.
The mechanism by which we will accomplish this goal is called the "main object index", which contains the quantum number of the big pointer array for each object; the number of an object is set when the object is created, by a directory facility which we will examine later.
Figure mainindex shows the path from the main object index through the big pointer array, the IRA segment, and the item index, to finally arrive at the data.
+-
| Index Quantum Number
| +----------+
Main | 0 |NO_QUANTUM|
Object| 1 | 5 +-------------------------+
Index | 2 | 9 | |
| 3 | 1 | |
| 4 | 7 | |
+- +----------+ |
+-----------------------------------------------+ |
|Big +- |±----+
|Pointer | element_count = 8000 |
|Array | last_quantum_added_to = 3 |
|Header +- |
| +- |
| | Quantum | Quantum
| | Index Number | 5
| | +-------------+ |
|Big | 0 | 2 +--------------+---+
|Pointer | 1 | 10 | | |
|Array | 2 | 12 | | |
| | 3 | 14 | | |
| | 4 | 11 | | |
| +- +-------------+ | |
+-----------------------------------------------+ |
+-------------------------------------------+ |
| +- Quantum Item |±------+
| | Index Number Number | Quantum
|Item | +-------------+ | 2
|Reference | 0 | 3 2 +----------+----------+
|Array | 1 | 3 4 +----------+----------+-+
|(IRA) | 2 | 3 5 +----------+----------+-+-+
|segment | 3 | 3 3 +----------+-------+ | | |
|0 | 4 | 3 1 +----------+-----+ | | | |
| +- +-------------+ | | | | | |
+-------------------------------------------+ | | | | |
+-----------------------------------------------+ | | | | |
| +- | | | | | |
| | Item # Offset Type Index | | | | | |
| | +----------------------------+ | | | | | |
| | 1 | 5 -+ VARSTRING 4 |±+-+ | | | |
| | 2 | 17 +| VARSTRING 0 |±+---+--+ | |
|Item | 3 | +-19 || VARSTRING 3 |±+---+ | |
|Index | 4 | ++-31 || VARSTRING 1 |±+--------+ |
| | 5 |+++-38 || VARSTRING 2 |±+----------+
| | 6 |||| 38 || UNUSED 0 | | Quantum 3
| | ++++----++-------------------+ +---------+
| +- ||| |+------------------------+ |
| ||| +-------------+ | |
| ||+----------------+ | | |
| +- |+-----+ | | | |
|Quantum | +-+------+-----------+-+-----------+----+ |
|Data | | BaldwinP.O.Box 0335NYSteve Heller11510| |
| +- +---------------------------------------+ |
| 333333333222222222211111111110000000000 |
| 876543210987654321098765432109876543210 |
+---------------------------------------------------------+
Of course, we need a place to store the main object index. A normal application might have five or ten main objects; therefore, our allocation of 256 entries in the main object index is extremely generous in most cases.11
Let's start our examination of the implementation of the quantum file code by looking at a simple program that uses it to store a number of strings and then retrieve them (Figure flextest.00).
codelist/flextest.00
Possibly the most striking feature of this program is how simple it
appears, which is a direct consequence of the power of C++. Writing a
similar program in C would require much greater knowledge of the
implementation of the quantum file algorithm. However, in C++ all we
have to know is which header file to include, how to open the quantum
file, and how to create a FlexArray object, which behaves
like a variable-sized array of variable-sized strings. Once we have
dealt with this small set of prerequisites, we can use this extremely
powerful and flexible data type in virtually the same way we would use
a built-in type.
Let's take a detailed look at exactly how we set up and use a
quantum file in this program. First, we have a couple of statements
that declare a temporary buffer variable which we'll use to construct
data to store in the quantum file and delete any previous version of
the quantum file itself. Next, we create a QuantumFile
object called QF via the default constructor of the QuantumFile
class. Then we call an Open function on that
object, supplying the name of the quantum file to be created.
The next statement creates a FlexArray object called TestArray.
The arguments to this constructor are a pointer to the QuantumFile
object that we'll use to store the data for TestArray,
and a name by which we could reuse this same FlexArray
object in another program or another execution of this program.
Now we're up to the loop where we will start to use the TestArray
object. The loop steps through the first 100 elements of TestArray,
assigning each element a String consisting of digits
corresponding to the position of that element in the TestArray
object. The strstream named Temp is used to
format the numeric values of the loop variable i into String
values that can be stored in a FlexArray object.
Finally, after we've put the values into TestArray,
the second loop displays the values in the TestArray
object.
At this point, I wouldn't be at all surprised if you had come to the conclusion that I have something up my sleeve ... and you would be right. Underneath this apparent simplicity is a wonderland of complex programming, which I hope I will be able to explain in an understandable way. Let's begin by examining some of these data types we've just seen, in enough detail to demystify them.
The first of these data types is the String. As it
happens, this is actually a typedef for a more general
type called SVector, so we'll start with the question of
exactly what an SVector might be.
An SVector is very much like an array, with the
following exceptions:
SVector,
you will get an error message rather than a mysterious crash. SVector has,
by calling its size member function. SVector after it has
been created, by calling its resize member function. SVector of a particular type to
another SVector of the same type. SVector by copying from another SVector
of the same type. While these are very useful characteristics for any type of SVector,
they are even more important in one particular type of SVector:
the String, which is an SVector of chars.
As a specialization of the SVector type, the String
has all of the characteristics of any other SVector, but
has additional abilities as well. Primary among these is its ability to
be read or written using the standard C++ input/output stream
functions. This String type, or one very much like it, is
one of the most obvious improvements in C++ over C, whose corresponding
char* type is notoriously error-prone. In this program, we
will use the String data type in much the same way as we
would use the char* data type in a C program.
Unfortunately, we will not be able to go into detail on the
implementation of either the SVector or String
classes in this book; we will just use their facilities as
described immediately above.
typedefAnother "type" used in this simple test program is ArrayIndex.
This is also a typedef, but for a much simpler underlying
type: unsigned long.
In this case, the reason I didn't just use the underlying type was
to enhance portability. The original version of this program was
implemented on a 16-bit compiler, where the definition of ArrayIndex
was unsigned short, primarily because it was impossible
to create an array or SVector having more than 64K
elements. When I ported this program to a 32-bit compiler, I changed
the definition of ArrayIndex and several other similar typedefs
and recompiled.
I would like to be able to tell you that the program worked without
further changes, but that would be a lie: it actually took me a couple
of days to get it working again. However, considering the size of the
source code for the classes (over 100KB), I was quite
happy to take such a short time to port it.
If I had used the explicit type unsigned short
everywhere, it would have taken me much longer to figure out which
occurrences of that type should be changed to unsigned long
and which should remain as they were. As it was, I found a few places
where I did not use a typedef or used the wrong one and
ran into trouble as a result. Overall, though, I did a pretty good job
in using appropriate types originally, which saved me a lot of trouble
later.
Now let's start our journey into the interior of the quantum file
implementation, starting with the QuantumFile class
itself, whose header file is shown in Figure blockh.00.
QuantumFile class
(quantum\block.h) (Figure blockh.00)codelist/blockh.00
I can't say that I'm completely happy with the design of this class.
Ideally, the user of any class should not have to worry
about any member data or member variables that the user cannot or
should not use. Unfortunately, this is not easily accomplished in C++.
I believe it is worthwhile to digress slightly here to discuss the
reasons why this is so.
First, let's examine the difficulty of "hiding" private
and protected member variables from the class
user. The technical problem here is that the compiler cannot allocate
space for a variable unless its size is known. This implies that you
cannot create a variable of a given class without access
to its interface definition, which must in turn include the definitions
of all of its member variables.
This does not allow the class user to use those
member variables unless they are public, so we don't have
to worry about non-member functions accidentally altering member
variables. However, it does mean that the user of a class
must ensure that all of the data types needed in the declaration of a class
are available when the header for that class is compiled,
and that the names of those data types do not conflict with any
previously defined names. This can be a source of considerable
difficulty when trying to use a fairly large library in a context other
than the one in which it was developed, which of course reduces the
payoff that such reuse would otherwise provide.
The other problem with the design of this class is
that it defines a number of functions that application programmers have
no need for. These functions are intended for use by other classes
that contribute to the functionality of the quantum file
implementation. You might think that I could simply make these
functions private to prevent their being called by
outside classes. Unfortunately, I wasn't able to do that
with Microsoft Visual C++ 5.0, one of the compilers with which this
code must work, as that compiler apparently cannot understand friend
declarations that specify classes that are nested within
other classes, as some of mine are. Even if this did
work, I'd still prefer to spare the application programmer the details
of these functions.
Is there a solution to either or both of these problems? Yes, in fact both of these can be remedied. Let's start with the data problem.
The standard solution to avoid exposing member variables to the
application programmer is to create a secondary data structure that
contains the actual data for the object, and include only a pointer to
such a structure in the main object. Because the compiler knows the
size of a pointer to a structure without having to see the definition
of the structure itself, the header file that defines this secondary
structure can be included only when needed in the implementation of the
class. This frees the application programmer from any
concern about the representation of the data.
What about those functions that the application programmer doesn't
need to (and shouldn't) call? This can be solved in much the same way
as the data problem. In this case, however, the pointer in the visible class
points to an object of a class that supplies the
additional functions needed by the other implementation classes.
Because the client program would not include the header file that
defines this secondary class, it could not access those
additional functions. Of course, the other implementation classes
would include that header file so that they could access the additional
functions as needed.
Why didn't I implement either of these solutions (or better yet,
both)? Because doing so would have required wholesale changes to the
implementation, and I ran out of time. Of course, had I considered
these issues when I was first creating these classes,
they would have been fairly simple to solve, but I didn't think of them
at that time.
In any event, the design is usable as it stands, even if it could
be improved, so let's move on to how we can use it in its present form.
We'll start with the functions that are needed by client programs such
as our test program flextest.cpp. Figure block.00 shows the first of these
functions, QuantumFile::QuantumFile().
QuantumFile class
(from quantum\block.cpp) (Figure block.00)codelist/block.00
As you can see, this function's implementation consists entirely of
initialization expressions in the member initialization list. We'll
look at each of the member variables of this class in
detail when we get to the functions that use them. For now, I'll just
mention that they are all needed to keep track of the status of the
quantum file and the buffers that maintain parts of its data in memory.
Next, let's take a look at Figure block.01,
which shows the first "regular member function" in this class,
QuantumFile::Open. This is a good place to start, since we
can't do anything to a quantum file until we open it, as with any other
file.
Open function for the QuantumFile class
(from quantum\block.cpp) (Figure block.01)codelist/block.01
After declaring a number of variables, this function initializes
the variables that keep track of the number of physical read and write
operations for benchmarking purposes (m_ReadCount and m_WriteCount,
respectively). Then it resizes the SVectors that keep
track of the block buffers (m_Block), which quantum is in
each block buffer (m_BlockNumber), whether each block
buffer has been modified and therefore needs to be written back to the
disk (m_BufferModified), and the "time stamp" for each
block buffer that indicates how recently it has been accessed (m_TimeStamp).
We'll see how each of these variables is used in the program as we go
through the code.
The next few lines check whether the caller has supplied a non-null file name for the quantum file. If not, we bail out, because we need a valid file name to use when we open the file to store our data.
Assuming that we do have some sort of file name, we try to open the
file for reading, to see whether it is already there. If fopen
returns NULL, indicating that the file doesn't exist yet, we create the
file, clear the QuantumFileHeaderStruct structure called QFHS,
set the main_object_index_count and free_space_list_count
to their default values, and copy the current program_version_info
into the quantum_version_info structure element.
The next operation is to initialize the member variables of QFHS
that keep track of the offsets in the file where the main_object_index,
the free_space_list, and the quantum file data area
begin, so that these areas of the file can be located when we open the
file again.
Since we haven't created any main objects yet, our next task is to
clear the main object index, setting it to zeros (which indicates
available entries, as quantum number zero is not used), except for the
entry for object number zero, which is set to NoQuantum;
we will not allow a main object number zero, for better error control.
We also write the header block to the file at this point.
The last operation in the new file initialization is to set up the free space list to indicate that all quanta have the maximum available free space, except of course for quantum zero, which is not used.
Of course, the process of opening an already existing quantum file starts differently; we have to read in the quantum file header and check whether it was created by the same version of the program. The current program simply tests whether the version of the program that created the file is different from the one reading it and aborts if they are different. A production program would use this information to determine whether the versions were compatible, so that if we changed the program in a way that makes old files unusable, we could warn the user that the file is of an old type and should be converted. Alternatively, we might automatically convert it to a new format; in either case, it is essential that we know what version of our program created the file. After we have taken care of this detail, the rest of the code for opening either an old or a new file is the same.
We are almost done with the initialization, but there are a few
more data structures to handle. First, we have to calculate the block
numbers where the main object list, the free space list, and the data
area start. Then we initialize the member variables that keep track of
the current number of main objects and the number of elements in the
free space list. Then we initialize the SVectors of block
buffers, block numbers, modified flags, and time stamps. Next, we
initialize variables that will allow us to access the free space list
and main object list; we'll get into exactly how that works later.
Finally, if we have created a new quantum file (as opposed to opening one that already existed), we create a main object called "Directory", which we will use to keep track of the names of the main objects that the user will create.
Now let's take a look at Figure block.02,
which shows the next user function, QuantumFile::Flush.
This function is used to ensure that all data written to the quantum
file has actually been stored on the disk. It's a good idea to call Flush
whenever you have finished with a logical portion of your data
updating, to make sure that a power failure or other crash doesn't lose
data.
Flush function for the QuantumFile class
(from quantum\block.cpp) (Figure block.02)codelist/block.02
This isn't a very complicated function, but it does show how the m_BufferModified
SVector is used: Each buffer whose flag in that SVector
is TRUE is written to the disk via the Write
member function. Any buffer whose flag is not TRUE has
not been modified since it was last read from the disk, so there is no
reason to write it back to the disk. Avoiding unnecessary writes is
very helpful in reducing disk I/O and improving the speed of the
program. The Write function resets the flag for any
buffer it writes, so if we called Flush twice in a row,
no buffers would be written out on the second execution.12
The final function that we're going to cover in this class
at this point is Close (Figure block.03). As its name suggests, this
function closes the quantum file and takes care of all the housekeeping
required at that time. However, you generally won't call this function
explicitly, because it is called automatically when the QuantumFile
object is destroyed at the end of its scope, as you can see by looking
at the QuantumFile destructor (Figure block.04).
Close function for the QuantumFile class
(from quantum\block.cpp) (Figure block.03)codelist/block.03
QuantumFile class (from
quantum\block.cpp) (Figure block.04)codelist/block.04
Allowing the destructor to close the file automatically is easier and (more important) safer than closing the file explicitly, because it's not always easy to determine when all of the objects in the quantum file are inactive. Hence, I recommend allowing the compiler to take care of closing the file.
Now that we've taken a look at the member functions of the QuantumFile
class that user programs are concerned with, let's do the
same with the FlexArray class. In the
process, we'll start to delve into the underlying layers of the quantum
file implementation in more detail. Let's start with the interface for
the FlexArray , which is shown in
Figure newquant.00.
class
FlexArray class (from
quantum\newquant.h) (Figure newquant.00)codelist/newquant.00
As before, we'll postpone discussion of the member variables of
this class until we see how they are used in the member
functions. We'll start with the normal constructor for the FlexArray
, which is shown in Figure newquant.01.
class
FlexArray class (from
quantum\newquant.cpp) (Figure newquant.01)codelist/newquant.01
As you can see, this function simply calls an Open
function in the same class to do all of the actual work.
The purpose of this handoff is to allow the user to create a FlexArray
via the default FlexArray constructor before opening the
quantum file and fill in the details later via an explicit Open
call. Now that I've cleared that up, let's take a look at the function
that actually does the work, which is shown in Figure newquant.02.
Open function for the FlexArray class
(from quantum\newquant.cpp) (Figure newquant.02)codelist/newquant.02
The header for this function shouldn't be too alarming. The first
argument is simply the address of the quantum file object in which the FlexArray
will be created. The second argument is the name under which the FlexArray
will be stored in the quantum file's directory, which will make it
accessible to another program or a later run of this program. The ModifiableElement
type is just another name for a String, left over from a
previous incarnation of that type. The third argument is the initial
number of elements that the FlexArray will contain, and
the last argument is the maximum number of elements to which the FlexArray
will expand if necessary. If you look at the interface for this class,
you'll notice that these last two arguments have default values of 1
and UINT_MAX-1 elements, respectively. Therefore, if you
don't have any idea how large a FlexArray might need to
be, you can omit these arguments and it will expand as needed.
Now we're up to the first statement inside the function, which
appears innocent enough. It assigns a value to a member variable called
m_MOA, which stands for "Main Object Array". This member
variable can then be used to look up the name of the FlexArray
in the quantum file directory, as well as to access a number of other
attributes of the quantum file. But what is the type of m_MOA,
and how does it give us access to the quantum file attributes?
To answer these questions, we're going to have to get into some
fairly complex implementation and language issues. We'll start this
excursion with a look at the interface for the MainObjectArrayPtr
class, shown in Figure newquant.03.
MainObjectArrayPtr class (from
quantum\newquant.h) (Figure newquant.03)codelist/newquant.03
You will notice that this class has an embedded class
called MainObjectArray defined within it, and that the
embedded class has quite a few member functions. You may
also notice that the MainObjectArrayPtr class
itself has only a few member functions, most of which are the standard
"structural" member functions -- the normal and default constructors,
the destructor, and the assignment operator. One of the constructors is
the conversion function that constructs a MainObjectArrayPtr
object from a QuantumFile pointer (used in the FlexArray
Open function), and the others are the default constructor
and copy constructor. However, there is one member function that is
anything but standard: operator->. Why would we want
to redefine that operator, and what can a custom version of operator->
do?
Before we get through with this question, we will have entered a realm of C++ where relatively few have trespassed. I don't believe in adding complexity for the sake of complexity, so why would I use a feature of C++ that most C++ programmers have never dealt with?
Because we need this feature to improve the ease of use and reliability of programs using this quantum file access method implementation. A little history lesson is appropriate here so you can see that I'm not speaking from a solely theoretical perspective.
The first edition of this book included an implementation of the quantum file access method written in C.13 It had many of the features that the current implementation has, but it was much harder to use. The main problem was that the user program had to deal with memory allocation issues, including remembering to free memory for variables whose storage was assigned inside the quantum file code. In addition, the syntax of a quantum file "array" was much less convenient than dealing with a built-in array. It was obvious to me that resolving all of these issues so that the user could ignore the inner workings of the quantum file would greatly improve the usability of this algorithm.
Unfortunately, this isn't possible in C; however, a major impetus behind the creation of C++ was to make it possible for class library designers to provide such facilities. Since the quantum file access method would benefit greatly from such an approach, I decided to rewrite it in C++.
Why am I using the word "rewrite", rather than "convert", "translate", or other euphemisms? Judging from the number of ads for "C/C++ programmers" I see in the newspapers, some employers have the notion that switching to C++ can be accomplished by changing source file extensions to ".cpp" and recompiling. After removing some syntax errors revealed by the stricter type checking of the C++ compiler, the program compiles and runs as it did before. What's so difficult about object-oriented programming? If you are one of these employers, or you have tried the above experiment yourself and now believe that you are a "C++ programmer", let me break the news to you gently: all you have accomplished is to switch to the C subset of C++, which has nothing to do with object-oriented programming. Virtually none of the code or design from the original quantum file implementation has survived the transition to object orientation unscathed, despite the fact that, according to an expert on object-oriented design, that original C implementation was "object-oriented"!
However, my loss (if that's what it was) can be your gain: I am going to break a long-standing tradition of object-oriented design literature by disclosing not only the final design but also many of the missteps, errors, and difficulties that ensued from my original determination to tackle this rather complex project. So, with no further ado, let's begin with an often neglected concept which is at the core of the implementation of this project: operator overloading.
One of the most powerful mechanisms for hiding the details of
implementation from the class user in C++ is operator
overloading, which means defining (or in some cases redefining) the
semantics of one or more of the standard operators +, -, =,
and so forth, as they apply to class objects. For better
or worse, the semantics of these operators cannot be changed as they
apply to "intrinsic" data types, so the calculation 2+2
is always going to result in 4; this restriction is
probably necessary, as the potential confusion could be horrendous.14 A good example of the most common type of
use for this facility is the implementation of +, -,
etc., for manipulating Complex class
objects, following the rules of complex arithmetic. The ability to
provide this sort of intuitive operation to the number-crunching user
is starting to make C++ (with the proper class libraries,
of course) a viable alternative to FORTRAN, long the dominant language
for such users.
As usual, it's probably best to start with a relatively simple
example: in this case, we'll look at an example of overloading operator-.15 Consider the program in Figure overload1.
operator- (Figure
overload1)codelist/minus.00
As you can see, I've created a very simplified version of the
dreaded Point class, used in innumerable
textbooks to represent a point on the Euclidean plane. The result of
running our sample program is 5, which is the distance
between the Point (1,1) and the Point
(4,5), calculated by the normal Euclidean calculation:
Distance = sqrt(xdiff*xdiff + ydiff*ydiff)
where xdiff is the difference between the x
coordinates of the two Points, and ydiff is
the difference between their y coordinates; sqrt,
of course, represents the square root function.
The "magic" is in the definition of operator-, which
is the syntax for redefining an operator; in this case it's the
subtraction operator. When the compiler sees the expression x-y,
it looks for a definition of operator- that is specified
for class Point, taking an argument which
is also of class Point. Since there is such
a definition, the compiler generates a call to the code specified in
that definition. Had there not been such a definition, a syntax error
would have resulted, since the compiler doesn't have any built-in
knowledge of how to subtract two Points.
Operator overloading doesn't have to be terribly complicated, as
that example illustrates. However, there's a trick to the overloading
of operator->. When we overload other operators, such
as our earlier example of operator-, our code takes over
the implementation of the operator completely. That is, the result of
calling our operator is whatever we say it is; in the operator-
example, it's the double result of the Euclidean distance
formula. However, this is not true with operator->; in
that case alone, a different scheme is followed by the compiler. Figure
overload2 shows some code derived
from an early version of the project. How does the compiler interpret
it?16
operator-> overloading
(Figure overload2)codelist/danger.00
The line X->Get(1); will be compiled as follows:
1. Since X is not a pointer to any type, its class
definition is examined for a definition of operator->.
2. Since this definition is found, a call to that code is inserted in the function being compiled.
3. Then the return value of the operator-> code is
examined. Since it is a pointer type (BlockPtr *, to be
exact), the compiler continues by generating code to call BlockPtr->Get(int)
with this equal to the result returned by operator->,
which is the same BlockPtr that we started with.
So far, this is doing exactly what we wanted it to do. However,
there is one serious problem with this method of implementing operator->,
which is illustrated by what happens when the line Y->Get(1);
is compiled. Unlike our previous example, Y is a
pointer (to a BlockPtr); therefore, the compiler doesn't
look for a definition of operator->, but merely
generates code to call BlockPtr->Get(int) with this
equal to Y. As a result, our overloaded operator is never
called.
Much the same problem will occur when the line X.Get(1);
is compiled. The compiler is happy to generate a call to BlockPtr.Get(int)
with this being equal to the address of X;
again, though, our custom operator-> code won't be
executed.
The reason for this problem is that the compiler stops looking for operator->
overloading whenever an actual pointer is found. After all, if you
already have a pointer in a place where a pointer is needed, you must
have what you want! Obviously, we're going to have to come up with a
type-safe solution, so we can't accidentally use the wrong syntax and
end up with a nasty bug.17 The solution is
to have not one, but two classes involved in the virtual
memory mechanism, an "outer" (or handle) class
and an "inner" (or body) class.18
The handle class has two primary functions: the
initializing of a body class object and the overloading
of operator-> to return a pointer to that body class
object, which actually does the work. Figure overload3 presents this solution.
operator-> overloading
(Figure overload3)codelist/safe.00
One interesting thing about this program is that the main program
refers to the Get function in exactly the same way as it
did before: in fact, the only difference visible to the
user is that the compiler is now able to prevent him (and us) from
compiling the incorrect references as we did before.
class
However, there are some changes to the internals that deserve
comment. First of all, in the class definition, the
previous BlockPtr class has been renamed to
Block and made a nested class of a new BlockPtr
class. The reason for the renaming is to prevent
unnecessary changes to the user's code; that was successful, as noted
above. Another design decision whose justification is not quite as
obvious is the use of a nested class rather than a
freestanding class. Why is this appropriate?
Since the purpose of the nested Block class
is solely to add type safety, rather than to provide any visible
functionality, nesting it inside the BlockPtr class
reduces "name space pollution". That is, if the user needs a Block
class for some other purpose, its name won't conflict with
this one. Such potential conflicts are going to become more serious as
the use of class libraries increases; we should do our
part to "preserve the environment" when possible.19
The fact that the Block class is visible
only to members of its enclosing class, BlockPtr,
is responsible for the somewhat odd appearance of the member function
declarations in Figure overload3.
For example, what are we to make of the declaration "char
BlockPtr::Block::Get(int p_Index)"?
Of course, we C++ programmers are used to "qualified" function
names, such as Block::Get. But why do we need two
qualifiers? Because the compiler knows the meaning of Block
only in the context of BlockPtr; that's what prevents the
global name pollution that would ensue if Block were
declared outside of BlockPtr. This means that we could,
for example, have another class called Block
nested inside another enclosing class called NewBlockPtr.
If that Block class had a Get
member function, we could declare it as "char
NewBlockPtr::Block::Get(int p_Index);", and the compiler would
have no problem telling which Block::Get we meant.
Similarly, an unadorned "Block::Get(int p_Index);" would
be yet another completely distinct function.20
One more safety feature in this new arrangement is worthy of note: the
constructor for Block is protected, so that
the user can't accidentally create one and use it instead of
referencing it through a BlockPtr object the way we
intend. This isn't very likely anyway, since the user would have to
specify BlockPtr::Block as the type, which isn't easy to
do by accident; however, according to the principles of object-oriented
programming, it is best to hide the internals of our classes whenever
possible, so that class user code can't depend on those
internals. That restriction makes it possible for us to improve the
implementation of our classes without breaking user code.
Now that we've investigated the notion of redefining operator->,
let's return to the MainObjectArrayPtr class
that we were discussing a little while ago. We'll start with Figure newquant.04, which shows the code
for the normal constructor for that class.
MainObjectArrayPtr (from
quantum\newquant.cpp) (Figure newquant.04)codelist/newquant.04
This function is fairly simple. It creates a new MainObjectArray
object using the normal constructor for that class. This
is the "body" object that will actually do the work of both of these classes.
Then it sets the reference count of that object to 1,
indicating that there is one MainObjectArray "handle"
object using that "body" object.
The general idea of reference counting is fairly simple, as most great ideas are (after you understand them, at least). It's inefficient to copy a lot of data whenever we set one variable to the same value as another; on the other hand, copying a pointer to the data is much easier. However, we have to consider how we will know when we can delete the data being pointed to, which will be when no one needs the data anymore.
One way to share data safely in such a situation is to write the
constructor(s), destructor, and assignment operator to keep track of
the number of objects that are using a particular body object, and when
there are no more users of that object, to delete it.21
We'll see how this works in the case of the MainObjectArray
body class and its MainObjectArrayPtr
handle class; the implementation is much the same in the
other handle/body classes in this program.
However, right now we're more interested in how we can use operator->
in the FlexArray::Open function (Figure newquant.02). After creating a
member variable called m_MOA via the normal constructor
of the MainObjectArrayPtr class, the next
line in the Open function invokes MainObjectArrayPtr::operator->
(Figure newquant.06).
MainObjectArrayPtr::operator-> (from
quantum\newquant.cpp) (Figure newquant.06)codelist/newquant.06
Given the definition of operator-> in Figure newquant.06, how will the compiler
interpret the expression m_MOA->FindObjectByName,
which is used in the function FlexArray::Open?
1. Since m_MOA is not a pointer to any type, the
interface of its class (MainObjectArrayPtr)
is examined for a definition of operator->.
2. Since there is such a definition, a call to that code is inserted in the function being compiled.
3. Then the return value of the operator-> code is
examined. Since it is a pointer type (MainObjectArray *,
to be exact), the compiler continues by generating code to call MainObjectArray->FindObjectByName
with this equal to the result returned by operator->,
which is the address of the body object of the MainObjectArrayPtr
that we started with. Of course, the same analysis applies to the other
uses of the m_MOA object in the Open
function as well as the other functions that use handle objects.
Now let's continue with the analysis of the Open
function in the FlexArray class. The
statement we've just examined calls a function called FindObjectByName,
which returns an index into the main object array. If the name supplied
in the call was in fact the name of an existing object in the main
object array, then the returned value will be the element number of
that object in the main object array. On the other hand, if the
specified name was not found in the main object array, the returned
value will be the special value NoObject. In that case, Open
will use another member function of MainObjectArray,
namely FindAvailableObject, to locate a free element in
the main object array. Then it will call the CreateMainObject
function to create a new main object which will use that free element.
At this point in the function, we have a valid index into the main
object array for our FlexArray, whether it existed
previously or was just created. Finally, we retrieve the current and
maximum element counts for the FlexArray object.
Now that we have covered the Open function in the FlexArray
class, there's only one other function in that class
that could use a significant amount of explanation: the implementation
of operator []. However, I'm going to postpone coverage
of that function until the next chapter, where we'll be examining the
somewhat complex task of making one of our data types look as much like
an array as possible.
In the meantime, we should go over the rest of the member functions
of the MainObjectArrayPtr class, so you can
see how this typical handle class works.
MainObjectArrayPtr classWe'll start with the default constructor (Figure newquant.07), which is pretty
simple. In fact, it's exactly like the normal constructor (Figure newquant.04), except that it uses
the default constructor of its body class to create its
body object rather than the normal constructor of the body class.
MainObjectArrayPtr (from
quantum\newquant.cpp) (Figure newquant.07)codelist/newquant.07
MainObjectArrayPtr Copy ConstructorThe copy constructor (Figure newquant.08) isn't much more complicated. It copies the pointer to the main object array (body) object from the existing object to the same member variable in the newly created object. Then it increments the reference count because there is now one more user of that body object.
MainObjectArrayPtr (from
quantum\newquant.cpp) (Figure newquant.08)codelist/newquant.08
MainObjectArrayPtr::operator = FunctionThe next function, MainObjectArrayPtr::operator =
(Figure newquant.09), is a bit
more complicated than the previous ones, because it has to account for
the reference counts in both the current object and the right-hand
object being copied from.
MainObjectArrayPtr::operator= (from
quantum\newquant.cpp) (Figure newquant.09)codelist/newquant.09
We start by decrementing the reference count of the current body object, because we won't be pointing to it after this operation (unless the right-hand object happens to point to the same body object as ours does). Then we check whether the reference count is less than or equal to 0, in which case we can (and will) delete the current body object.22 After taking care of that issue, we copy the right-hand object's main object array pointer to the corresponding member variable in our object and increment the reference count for that main object array. Finally, we return our object to the caller.
MainObjectArrayPtr DestructorThe final function in the MainObjectArrayPtr class
is the destructor (Figure newquant.10).
MainObjectArrayPtr class (from
quantum\newquant.cpp) (Figure newquant.10)codelist/newquant.10
This shouldn't seem at all mysterious after our analysis of the operator
= function, as its operation is basically a subset of what that
function does. That is, it decrements the reference count of its body
object; if the resulting reference count is less than or equal to 0, it
deletes the body object.
Now that we're through with the MainObjectArrayPtr class,
we'll examine its body class, MainObjectArray,
so we can see how this fundamental data type of the quantum file
algorithm works.23
MainObjectArrayPtr::MainObjectArray classWe'll skip over the default constructor and destructor, as these
functions don't do anything particularly interesting. However, the
normal constructor for this class does some
initialization that we should pay attention to, as shown in Figure newquant.11.
MainObjectArray class
(from quantum\newquant.cpp) (Figure newquant.11)codelist/newquant.11
As you can see, this constructor begins by checking whether its
argument is non-null; if so, it continues by initializing a member
variable to the quantum file pointer provided by that argument. Then it
uses that quantum file pointer to retrieve the number of main objects
in the quantum file as well as the number of blocks that the main
object list occupies. Then it resizes its block pointer list to that
count and initializes each of the elements in the list to the
appropriate block pointer. We'll see exactly how those block pointers
work when we get back to the QuantumFile class,
but for now we'll just assume that they allow access to the individual
blocks that make up the main object list.
MainObjectArray::Get FunctionNow let's take a look at the Get function in this class
(Figure newquant.12).
MainObjectArray::Get function (from
quantum\newquant.cpp) (Figure newquant.12)codelist/newquant.12
This function is reasonably straightforward. It begins by checking
the index of the element in the main object array that the caller wants
to retrieve. If it is out of range, the return value is NoObject,
indicating an error. However, assuming that the index is valid, the
function calculates which block and element within that block
corresponds to the requested entry. It then retrieves that element via
the MainObjectBlock::Get function, which is called for
the m_BlockPtr variable for that block.24
Finally, the result of the Get function, which is the
quantum number of the big pointer quantum of the object in question, is
returned to the calling function.
MainObjectArray::Set FunctionThe next function we'll cover, MainObjectArray::Set
(Figure newquant.13), is the
counterpart to Get.
MainObjectArray::Set function (from
quantum\newquant.cpp) (Figure newquant.13)codelist/newquant.13
This function takes two arguments, the element number in the main
object array and the quantum number of that object's big pointer
quantum. After checking that the element number is valid, it calculates
the block number and element number in which the desired entry will be
found, then calls the MainObjectBlock::Set function for
the m_BlockPtr variable of the appropriate block in the
main object array. Finally, if the entry being stored is equal to NoObject
and the element number of this element is less than the current "lowest
free object", then we update that variable to indicate that we should
start looking at this place in the main object array the next time we
are searching for a free object.
MainObjectArray::FindAvailableObject FunctionThe next function we'll cover, MainObjectArray::FindAvailableObject
(Figure newquant.14), is used to
find an available entry in the main object array when we want to create
a new object.
MainObjectArray::FindAvailableObject function
(from quantum\newquant.cpp) (Figure newquant.14)codelist/newquant.14
This isn't terribly complicated either. It searches the main object
list block by block, starting with the entry corresponding to the last
known value of the "lowest free object" variable and continuing until
it finds an empty entry (i.e., one whose value is NoObject).
Once it finds such an entry, it calculates the corresponding element
number in the main object array and returns it as the result after
storing it in the "lowest free object" variable.25
MainObjectArray::GetModifiableElement FunctionThe next function we'll cover, MainObjectArray::GetModifiableElement
(Figure newquant.15), is used to
retrieve a ModifiableElement value from a main object.
MainObjectArray::GetModifiableElement function
(from quantum\newquant.cpp) (Figure newquant.15)codelist/newquant.15
Among its many uses, GetModifiableElement is the
function that a FlexArray uses to retrieve values, so it
is quite important to the functioning of our test program. Although it
is somewhat longer than most of the other functions we've seen so far,
it's really not too complicated, and it will serve as a good
introduction to the more complex functions in this class.
We start by using the Get member function to look up
the big pointer array quantum number for the object in which the
element is contained. Once we have that quantum number, we call the
function QuantumFile::MakeBigPointerBlockPtr to create a
handle object that will allow us to access that quantum as a big
pointer array. Next, we calculate which big pointer array entry
contains the quantum number of the little pointer block that contains
the element reference that we'll use to access the actual data. Once we
know that, we can call the BigPointerBlock::GetBigArrayElement
function to retrieve the quantum number for that little pointer block.
Then we call the function QuantumFile::MakeLittlePointerBlockPtr
to create a handle object that will allow us to access that quantum as
a little pointer array. Once we have done that, we calculate which
element in the little pointer array is the element reference to the
element we are interested in, and retrieve that element reference from
the little pointer array.
Now we're ready to access the element itself, assuming that it
actually exists. First, though, we have to account for the possibility
that it is a null item, that is, one that contains no data. If this is
the case, we return an empty (i.e., default-constructed) ModifiableElement.
Why have I handled this case separately, rather than storing a
zero-length ModifiableElement in the leaf block?
A null item is an optimization: we could treat items containing no
data like all other items. However, special handling for this case is
worthwhile, since it is very common. For example, a word processor
might store each line of a file as a string, removing the trailing null
from each string before storing it; with that approach, strings
consisting only of a trailing null would be stored as null items. The
advantage of using null items is that, rather than occupying an item
index slot indicating a length of zero, they are represented by item
references that contain a special quantum number called NO_QUANTUM
and a special item number called NULL_ITEM_NUMBER,
neither of which can appear in a real item reference. This means that a
null item does not occupy any space in a item index, and that we can
avoid loading a quantum to retrieve its value.
However, let us suppose that the element reference is not null. In
that case, we call QuantumFile::MakeLeafPointerBlockPtr
to create a handle object that will allow us to access items in the
quantum containing the desired item. Then we use QuantumBlock::GetModifiableItem
to retrieve the actual item and return it to the caller.
MainObjectArray::PutElement FunctionThe next function we'll cover, MainObjectArray::PutElement,
shown in Figure newquant.16, is
used to store a ModifiableElement value into a main
object.
MainObjectArray::PutElement function (from
quantum\newquant.cpp) (Figure newquant.16)codelist/newquant.16
This starts out by using a preprocessor macro called qfassert
to check whether the size of the item to be stored is legal, i.e., that
it is less than the maximum allowable item size.26
After that, the code is very similar to its counterpart, GetModifiableElement,
until we retrieve the item reference from the little pointer array.
Then, if the item reference is not null, we call QuantumBlock::DeleteItem
to delete the current contents of that element. Then we check whether
the size of the element to be stored is zero; if so, we store a null
reference in the little pointer array (as noted previously).
On the other hand, if we have some data to be stored, we have to
find a place to store it. The most logical place to look is in the
quantum that we have most recently used to store something for this
object, and this information is therefore maintained by the big pointer
block access code in the BigPointerBlock class.27 Therefore, we check whether there may be
enough free space in that quantum by calling QuantumFile::QGetFreeSpace.28 Because a free space code is only one
byte (to reduce the size of the free space list), it cannot exactly
represent the amount of free space in a quantum. Therefore, if the free
space list entry indicates that this quantum may have enough space for
the item, we have to retrieve the quantum via QuantumFile::MakeLeafBlockPtr
and then call QuantumBlock::CalculateFreeSpace to
calculate its free space more precisely. If the actual free space is
larger than the size of the item, then we have found the quantum we
will use to store the item.
However, if that quantum does not have enough space to store the
item, then we will have to search for one that does. That is handled by
QuantumFile::FindSpaceForItem, which returns the number of
the first quantum that fits the bill.
At this point, we have a quantum in which to store the item.
Therefore, after setting the "last quantum added to" member variable,
we set the quantum number in the element reference to that quantum
number. Then we call QuantumBlock::AddItem to add the
item to that quantum, and set the item number in the element reference
to the value returned from that function. Finally, we update the little
array pointer that points to the newly added element.
MainObjectArray::GetMainObjectElementCount
FunctionThe next function, MainObjectArray::GetMainObjectElementCount
(Figure newquant.17), is used to
retrieve the number of elements that a main object currently contains.
MainObjectArray::GetMainObjectElementCount
function (from quantum\newquant.cpp) (Figure
newquant.17)codelist/newquant.17
You might think that this would be a trivial "get" function that merely returns the value of a member variable, but that is not the case. The reason is that this information is stored in the quantum file itself, as part of the big pointer array, because it needs to be maintained between executions of this program (or any other program that might want to use the file).
However, it's not too complicated. After using the Get
function to retrieve the quantum number of the big pointer array for
the object, it uses QuantumFile::MakeBigPointerBlockPtr
to provide access to the big pointer array. Then it calls BigPointerBlock::GetBigArrayElementCount
to retrieve the value to be returned.
MainObjectArray::GetMainObjectMaxElementCount
FunctionThe next function, MainObjectArray::GetMainObjectMaxElementCount
(Figure newquant.18), is used to
retrieve the maximum number of elements that a main object is permitted
to contain. Because it is exactly like the previous function except for
the function it calls in the BigPointerBlock class,
I'll just list it without further comment.
MainObjectArray::GetMainObjectMaxElementCount
function (from quantum\newquant.cpp) (Figure
newquant.18)codelist/newquant.18
MainObjectArray::CreateMainObject FunctionThe next function we'll cover, MainObjectArray::CreateMainObject,
shown in Figure newquant.19, is
used to create a new main object.
MainObjectArray::CreateMainObject function (from
quantum\newquant.cpp) (Figure newquant.19)codelist/newquant.19
This function is somewhat complex, but that should be
understandable because it takes quite a bit of work to create a new
main object. We start off by checking whether the user has asked to
create a zero-length object. Such an object would cause complications
in the way that the size of the little pointer array is calculated;
namely, the calculation of the number of blocks needed to hold the
little pointer arrays (LittlePointerBlockCount) produces
correct results for all values of the element count that are greater
than 0 but produces a nonsensical result when the element count is 0.
Because arrays can grow in size, a zero-element array may be a
reasonable thing to create, so I didn't want to disallow them; the
simplest solution, and the one I adopted, was to treat a request to
create a zero-element array as a request for a one-element array.
Because a zero-length object isn't very useful until it has been
resized, this isn't likely to surprise the user.
Once that detail has been handled, we compute a free space entry
indicating that this block is no longer available for use. Then we
continue by calling QuantumFile::FindEmptyBlock to find a
block for the big pointer array, then calling QuantumFile::MakeBigPointerBlockPtr
to allow access to that block as a big pointer block.
Now we are ready to begin setting up the big pointer block for use.
We start by calling BigPointerBlock::Clear to erase any
previous data in the block.29 Next, we
call BigPointerBlock::SetQuantumType to set the type of
the quantum to indicate that it contains a big pointer array; this
information is used for consistency checking when accessing a quantum
that is supposed to be a big pointer array quantum.30
Then we call BigPointerBlock::SetMainObjectNumber to set
the entry in the big pointer block that indicates which main object it
belongs to. This is needed when we are updating the free space list,
because each free space entry includes the number of the main object
for that quantum. This is relevant because we do not share a quantum
among several main objects, to simplify the task of programs that
recover as much data as possible from a quantum file that has become
corrupted.
The next step is to set up the header for the big pointer array.
This contains the current element count, the maximum element count, and
the last quantum that we've added an element to; of course, the latter
variable is initialized to NoQuantum, as we have not yet
added any data to this new object. Once we have set the header up, we
call QuantumBlock::AddItem to add it to the big pointer
block, so that it will be accessible the next time we use the big
pointer array.
Now it's time to initialize the little pointer array for this object and the big pointer array that allows us to access the little pointer array. To start this process, we calculate the number of little pointer blocks that we will need, which is dependent on the number of elements in the object and the number of item references that fit in one block. Once we have determined that number, we can create a big pointer array containing that number of quantum numbers, and a temporary little pointer array that we will use to initialize each block in the little pointer array.
We're up to the loop that initializes all of the blocks in the
little pointer array. It begins by calling QuantumFile::FindEmptyBlock
to find an empty block that we can use for the little pointer array.
Then we call QuantumFile::SetFreeSpace to reserve this
quantum for the current main object and mark it as full.31
Next, we set the current element of the big pointer array to the
quantum number of the quantum where we are storing the current little
pointer array block.
Now we're ready to initialize the little pointer block. We start by
calling QuantumFile::MakeLittlePointerBlockPtr to allow
us to treat the current quantum as a little pointer block. Once we have
done that, we call LittlePointerBlock::Clear to
initialize the quantum to an unused state, LittlePointerBlock::SetQuantumType
to set the quantum type to "little pointer array" (for consistency
checking while retrieving data), LittlePointerBlock::SetMainObjectNumber
to allow us to update the free space list when adding data to this
quantum, QuantumBlock::AddItem to add a section of the
little pointer array to this quantum, and LittlePointerBlock::ClearLittleArray
to initialize that section of the little pointer array to null item
references.
One detail that might not be immediately obvious is why we have to
call QuantumFile::SetFreeSpace again at the end of the
loop. The reason is that the QuantumBlock::AddItem
function recalculates the free space for the block to which it adds the
item. Since we want to prevent any further data from being added to
this block, we have to reset the free space manually (to "none
available") to indicate that this block is full.
Once we have initialized all of the little pointer array sections,
we're ready to finish setting up the big pointer array for use. The
first step in that process is to add the big pointer array data we have
just filled in to the big pointer quantum, via QuantumBlock::AddItem.
Then we use QuantumFile::SetFreeSpace to reserve the big
pointer quantum for our new main object and mark it full. Then we use
the member function Set to set the big pointer array
quantum number for our new main object. Finally, we call the member
function PutElement to set up the directory entry for the
new main object.
MainObjectArray::FindObjectByName FunctionThe next function we'll cover, MainObjectArray::FindObjectByName
(Figure newquant.20), is used to
look up a main object in the "directory".
MainObjectArray::FindObjectByName function (from
quantum\newquant.cpp) (Figure newquant.20)codelist/newquant.20
This one is fairly simple. It steps through the elements of the
main object directory, looking for an element whose contents are equal
to the argument to the function. If it finds such an element, it
returns the index of that element, which is the object number of the
object with that name. On the other hand, if it gets to the end of the
main object directory without finding a match, it returns the value NoObject.
MainObjectArray::GrowMainObject FunctionThe next function, MainObjectArray::GrowMainObject
(Figure newquant.21), is used to
increase the size of a main object when needed to accommodate more
elements.
MainObjectArray::GrowMainObject function (from
quantum\newquant.cpp) (Figure newquant.21)codelist/newquant.21
This is about as complicated as CreateMainObject.
However, because it is quite similar to that function, we'll be able to
shorten the explanation considerably. We start by retrieving the
existing big pointer block for the object. Then we copy the big pointer
array from that block to a new big pointer array. Next, we calculate
the size of the new little pointer array needed to hold references to
all the elements in the new, enlarged, object.
The next few lines of code might be a little mysterious if I don't
explain a few details of the size calculation. To simplify the
accounting needed to keep track of the sizes of all of the little
pointer arrays, I've decided always to allocate a little pointer array
to the largest size that will fit into a quantum. However, setting the
accessible size of an array to the actual number of elements allocated
for its little pointer arrays has the drawback that the user wouldn't
be able to find out when the program accessed an element that is within
the allocated size but outside the number of elements specified when
creating the array. Therefore, I've crafted a compromise: when an array
is created, I set the number of accessible elements to the number the
user specifies32; however, if the array is
increased in size, the new size is always equal to the total allocation
for all little pointer arrays. This means that if an array is expanded
by a reference to an element that was already allocated but was outside
the specified array size, GrowMainObject simply resets
the size to the number of elements contained in all little pointer
arrays already extant and returns that value. Otherwise, a new little
pointer array is allocated, and the big pointer array is updated to
reflect the change.
Assuming that we actually do need to allocate some new little pointer blocks, we continue by increasing the size of the big pointer array to accommodate the new number of little pointer blocks. Then we allocate and initialize each of the new little pointer blocks in exactly the same way as we allocated and initialized the original little pointer blocks when we created the object. Then we delete the old big pointer array from the big pointer block, add the new big pointer array to the big pointer block, set the free space entry for the big pointer block to indicate that it is full, and return the new element count to the caller.
Now that we have reached the end of the discussion of the MainObjectArray
class, we'll return to the QuantumFile class.
This time, we're going to get into the member functions that are used
in the implementation, since we've already dealt with those that are
used by outside programs.
QuantumFile classWe'll skip over the QuantumFile member functions that
fall in three categories:
1. Those that merely compute the number of blocks occupied by a
segment of the quantum file (i.e., GetMainObjectBlockCount
and QGetFreeSpaceListBlockCount).
2. Those that merely return a block of the appropriate type to be
accessed as a little pointer block, big pointer block, and the like
(i.e., MakeFreeSpaceBlockPtr, MakeMainObjectBlockPtr,
MakeLittlePointerBlockPtr, MakeBigPointerBlockPtr,
MakeLeafBlockPtr).
3. Those that merely call member functions of the FreeSpaceArray
class (i.e., SetFreeSpace, QGetFreeSpace,
FindEmptyBlock, FindSpaceForItem).
We'll see how the functions in the third category work when we examine the underlying functions that they call.
QuantumFile::SetModified FunctionHowever, that still leaves us with several functions to discuss in
the QuantumFile class. We'll start with QuantumFile::SetModified,
which is shown in Figure block.06.
QuantumFile::SetModified function (from
quantum\block.cpp) (Figure block.06)codelist/block.06
This is actually a fairly simple function. It calls a global
function called FindBuffer to look up the specified
quantum number in the list of quantum buffers.33
If the quantum with that quantum number is in fact in one of the
quantum buffers, then the "buffer modified" flag for that buffer is
set, and the function returns. The only complication is a consistency
check in the form of an "assert" that the quantum in question is
actually in one of the buffers. If that is not the case, then we have a
serious problem, because the program should not be trying to set the
modified flag for a quantum that isn't in memory: that indicates that
the buffer that contained the quantum has been reused too soon.
QuantumFile::Position FunctionThe next function we will examine is QuantumFile::Position,
which is shown in Figure block.07.
QuantumFile::Position function (from
quantum\block.cpp) (Figure block.07)codelist/block.07
This function is used by the Read and Write
functions to position the file pointer to the correct place for reading
or writing a quantum. This is quite simple because every block is the
same size. However, it is important to remember that the quantum file
contains data other than the quanta themselves, so the block number of
a quantum is not the same as its quantum number. To convert between
these two numbers is the responsibility of the function that computes
the block number supplied to Position.
QuantumFile::Read FunctionThe next function we will examine is QuantumFile::Read,
which is shown in Figure block.08.
QuantumFile::Read function (from
quantum\block.cpp) (Figure block.08)codelist/block.08
After positioning the file pointer to the location in the file where this block is stored, we read it into the buffer. Then we set the modified flag for the buffer to FALSE, indicating that the disk and memory versions of the block are the same. Finally, we increment the read count, which is used for performance analysis.
QuantumFile::Write FunctionThe next function we will examine is QuantumFile::Write,
which is shown in Figure block.09.
This is identical to the Read function, except of course
that it writes the data to the file rather than reading it.
QuantumFile::Write function (from
quantum\block.cpp) (Figure block.09)codelist/block.09
QuantumFile::MakeBlockResident FunctionThe last function in this class is QuantumFile::MakeBlockResident.
This is where we rejoin the thread of discussion about operator->.
You see, MakeBlockResident is called during the execution
of operator-> for any of the BlockPtr classes.34 Figure mbr1
shows an early implementation of this routine.
codelist/mbr1.00
This is a fairly straightforward implementation of a least recently
used (LRU) priority algorithm, but there are a few wrinkles to reduce
overhead. First, we use the reference parameter p_OldBufferNumber
to retrieve the buffer number (if any) in which the searched-for block
was located on our last attempt. If the number of the block held in
that buffer matches the one we're looking for, we will be able to avoid
the overhead of searching the entire buffer list. The reason that p_OldBufferNumber
is a reference parameter is so that we can update the caller's copy
when we locate the block in a different buffer; that way, the next time
that MakeBlockResident is called to retrieve the same
block's address, we can check that buffer number first.
In order to make this work, we can't implement the LRU priority
list by moving the most recently used block to the front of the block
list; the saved buffer number would be useless if the blocks moved
around in the list every time a block other than the most recently used
one was referenced. Instead, each slot has an attached timestamp,
updated by calling the time function every time the
corresponding block is referenced. When we need to free up a buffer, we
select the one with the lowest (i.e., oldest) timestamp. If that buffer
has the "modified" attribute set, then it has been updated in memory
and needs to be written back to the disk before being reused, so we do
that. Then we read the new block into the buffer, update the caller's p_OldBufferNumber
for next time, and return the address of the buffer.
This seems simple enough, without much scope for vast improvements
in efficiency; at least, it seemed that way to me, until I ran Turbo
ProfilerTM on it and discovered that it was horrendously
inefficient. The profiler indicated that 73% of the CPU time of my test
program was accounted for by calls to the time routine!
To add insult to injury, the timing resolution of that routine is quite
coarse (approximately 18 msec), with the result that most of the blocks
would often get the same timestamp when running the program normally.35 Upon consideration, I realized that the
best possible "timestamp" would be a simple count of the number of
times that MakeBlockResident was called. This would
entail almost no overhead and would be of exactly the correct
resolution to decide which block was least recently used; this is the
mechanism used in the current version.
One interesting consideration in the original design of this
mechanism was what size of counter to use for the timestamp. At first,
it seemed necessary (and sufficient) to use a 32-bit type such as an unsigned
long, which would allow 4 billion accesses before the counter
would "turn over" to 0. However, because the original implementation
was compiled with a 16-bit compiler, the natural size to use would be
an unsigned (meaning unsigned short)
variable. Therefore, I had to decide whether a longer type was
necessary.
After some thought, I decided that it wasn't. The question is: what happens when the counter turns over? To figure this out, let's do a thought experiment, using two-byte counters. Suppose that the priority list looks like Figure timestamp1, with the latest timestamp being 65535.
Block number Timestamp
Let's suppose that the next reference is to block 9, with the counter turning over to 0. The list will now look like Figure timestamp2.
Block number Timestamp
Although the preceding analysis was good enough to convince me not to worry about the counter turning over, unfortunately it wasn't good enough to convince the machine. What actually happened was that after a large number of block references, the program started to run very slowly. I was right that the data wasn't in danger, but performance suffered greatly. Why? Let's go back to our example. Which buffer would be replaced when the next block needed to be read in? The one currently holding block 9, since it has the "lowest" priority. If that block number happened to be 32, for example, that would leave us with the arrangement in Figure timestamp3.
Block number Timestamp
The acid test of this virtual memory mechanism
is to run the program with only one block buffer; unsurprisingly, this
test revealed a nasty bug. It seems that I had neglected to initialize
the value of the EarliestStamp variable, used to keep
track of the buffer with the earliest timestamp. When running with only
one buffer, it was possible under some circumstances for a block to be
replaced before it was ever used; when this happened, the timestamp on
the buffer it had occupied was left set to its initial value of ULONG_MAX.
This initial value was significant because the search for the earliest
timestamp also starts out by setting the TimeStamp
variable to ULONG_MAX, which should be greater than any
timestamp found in the search. If there were no blocks in the list with
a "real" timestamp, the conditional statement that set the EarliestStamp
value in the search loop was never executed. As a result, the EarliestStamp
variable was left in an uninitialized state, which caused a wild access
to a nonexistent block buffer. The fix was to initialize EarliestStamp
to 0, so the first buffer will be selected under these circumstances;
you can see this implemented in the current version of MakeBlockResident
(Figure block.05).
QuantumFile::MakeBlockResident function
MakeBlockResident function (from
quantum\block.cpp) (Figure block.05)codelist/block.05
We've already covered most of the tricky parts of this function, but
let's go over it one more time just to be on the safe side. We begin by
incrementing the counter that serves as a timestamp; if it turns over
to 0, we set all of the timestamps to 0 to prevent the newest block
from being replaced continually.36 Then we
look up the buffer number for the quantum in question, via the FindBuffer
global function. If the quantum is found in a buffer, then we merely
reset the timestamp on that buffer to indicate that it is the most
recently used, and return a pointer to that buffer for use by the operator->
function that called MakeBlockResident.
However, in the event that the quantum we want isn't yet in a buffer, then we have to figure out which buffer it should be read into. We do this by examining the timestamp for each buffer, and selecting the buffer with the earliest timestamp.
Once we have found the buffer that we are going to reuse, we check
whether it has been modified; if so, we write it back to the disk. Then
we set its block number to the block number of the new quantum that is
being read in, set its modified flag to FALSE, set the timestamp on the
buffer to the current timestamp, and read the quantum into the buffer.
Finally, we return a pointer to the buffer for use by the calling operator->
function.
Now that we've seen how MakeBlockResident works, let's
see how it is used in one of the "BlockPtr" classes.
Because all of these classes work in much the same way,
we will examine only one of them in any detail. I've chosen LittlePointerBlockPtr
because it made for a better section title, but any of them would have
done just as well. First, Figure blocki.00
shows the interface for this class.
LittlePointerBlockPtr class
(from quantum\blocki.h) (Figure blocki.00)codelist/blocki.00
As you can see, this interface isn't very large. In fact, with only one member variable and three functions, it's the smallest one we'll see.37 However, that doesn't mean that it is completely without interest, as we'll discover. Let's start with the normal constructor, shown in Figure block.10.
LittlePointerBlockPtr class
(from quantum\block.cpp) (Figure block.10)codelist/block.10
This constructor is simple but unrevealing: to see the effects of
the variables it is initializing, we will have to wait until we examine
the LittlePointerBlock class. For now, let
me just say that the block number is the number of the block where the
little pointer block is stored, and the block manager is a pointer to
the QuantumFile object that controls the quantum file
where the little pointer block is stored. The block variable is a
pointer to the buffer where the block is stored in memory; it is
assigned a value in the next function we will examine, operator->
(Figure block.11).
LittlePointerBlockPtr::operator-> Function
LittlePointerBlockPtr::operator-> function
(from quantum\block.cpp) (Figure block.11)codelist/block.11
As we've seen in the prior discussion of operator->,
the return value of this function will be used to point to the object
for which the final function will be executed. In this case, that
object will be the little pointer block object embedded in this LittlePointerBlockPtr
object. The main purpose of this operator-> function
is to make sure that the block that the little pointer block object
refers to is in memory before we try to do anything with it. That's why
QuantumFile::MakeBlockResident exists, and that's why we
call it here. Once that function returns, we know that the little
pointer block is resident in memory. At that point, we set the item
index member variable to the address of the item index in the block,
call LittlePointerBlock::SetLittlePointerArray to allow
access to the little pointer array in the block, and return the address
of our little pointer block object.
When the function that called our operator->
resumes execution, it will reapply the same function call to the
pointer that we return. This time, because the return value is in fact
a pointer rather than an object, the function call will go through to
its final destination.38 This would seem
to be a logical time to start our examination of how the body class,
LittlePointerBlock, works. However, before we get to that class,
we have quite a bit of ground to cover. You see, the block classes
are arranged in an inheritance tree that begins with a very simple
structure called Block, and builds in complexity from
there. The best place to start is at the beginning, so that's what
we'll do.
Figure blocki.01 shows the
definition of the Block union.
Block union (from
quantum\blocki.h) (Figure blocki.01)codelist/blocki.01
I don't use the union type very much. In fact, this Block
type is the only example of a union in this book. That's
because they can be the source of subtle errors if you don't keep track
of the actual type of the data in a union very carefully.
However, in this case it came in very handy because the different types
of blocks have to be identified at run-time anyway; putting the logic
that discriminates among the various types of blocks into the various
block classes should make it reasonably safe, and in fact
I haven't seen any problems from this implementation decision.
So what are all the types of data that can be stored in a Block?
The simplest is an array of char taking up the entire
block; this low-level type is used when we are copying a block or
otherwise treating it as simply a bunch of undifferentiated data. The
next possibility, a 64-element array of char, is used in
the header block of a quantum file, and consists of a 64 character
"version information string". This is used to determine whether the
version of the quantum file implementation that wrote the file is the
same as the current implementation, as described earlier. The next type
is an array of QFreeSpaceEntry elements, which is used in
the free space list classes. The next possible type is an
array of MainObjectEntry elements, which is used in the
main object classes. Finally, a Block can
contain a QuantumBlockStruct, which is the data type used
to access blocks that contain normal quanta.
QuantumBlockStruct structThe last of these types is actually a bit more complicated than it
looks, as we'll see when we examine the definition of QuantumBlockStruct,
shown in Figure blocki.02.
QuantumBlockStruct struct (from quantum\blocki.h) (Figure blocki.02)codelist/blocki.02
What makes this apparently innocent QuantumBlockStruct
data type more mysterious than it looks? The clue is in the comment on
the m_ItemIndex variable: this is actually a place holder
for a variable-length array of item index elements. Although you cannot
directly declare a variable-length array in C or C++, it is possible to
use such a place holder to simplify the calculation of the beginning
address of a variable-length array that is "mapped" onto a buffer, as
we do when accessing data in a quantum block. We'll see how this works
when we get to the first of the block classes that is
used for accessing data in the normal quantum format rather than as one
of the other types (e.g., free space entry or main object entries).
QuantumBlockHeader structWe should also take a look at the definition of QuantumBlockHeader,
shown in Figure blocki.03.
QuantumBlockHeader struct (from quantum\blocki.h) (Figure blocki.03)codelist/blocki.03
This struct represents the data that is present at
the beginning of every block that holds a normal quantum. It consists
of three fields: the quantum type field, the object number field, and
the item count field.
The quantum type field contains a value indicating the type of the quantum, such as "little pointer array", "big pointer array", or "leaf node". This information could be used to allow a deeper hierarchy of pointer types, but currently it is used only for consistency checking. That is, the top-level quantum pointed to by a main object entry must be a big pointer array quantum, the quantum pointed to by a big pointer array entry must be a little pointer array quantum, and the quantum pointed to by a little pointer entry must be a leaf node (i.e., one that contains user data).
The object number field contains the number of the main object to which this quantum belongs. This is used when we update the free space list, because we do not share a quantum among more than one main object.
The item count field represents the number of data items in this
quantum, which is also the number of entries in the variable-length m_ItemIndex
array.
BaseBlockPtr classNow that w