Chapter 9. Data Structures

Perl provides for free many of the data structures that you have to build yourself in other programming languages. The stacks and queues that budding computer scientists learn about are both just arrays in Perl. When you push and pop (or shift and unshift) an array, it's a stack; when you push and shift (or unshift and pop) an array, it's a queue. And many of the tree structures in the world are built only to provide fast, dynamic access to a conceptually flat lookup table. Hashes, of course, are built into Perl, and provide fast, dynamic access to a conceptually flat lookup table, only without the mind-numbingly recursive data structures that are claimed to be beautiful by people whose minds have been suitably numbed already.

But sometimes you want nested data structures because they most naturally model the problem you're trying to solve. So Perl lets you combine and nest arrays and hashes to create arbitrarily complex data structures. Properly applied, they can be used to create linked lists, binary trees, heaps, B-trees, sets, graphs, and anything else you can devise. See Mastering Algorithms with Perl (O'Reilly, 1999), the Perl Cookbook (O'Reilly, 1998), or CPAN, the central repository for all such modules. But simple combinations of arrays and hashes may be all you ever need, so they're what we'll talk about in this chapter.

9.1. Arrays of Arrays

There are many kinds of nested data structures. The simplest kind to build is an array of arrays, also called a two-dimensional array or a matrix. (The obvious generalization applies: an array of arrays of arrays is a three-dimensional array, and so on for higher dimensions.) It's reasonably easy to understand, and nearly everything that applies here will also be applicable to the fancier data structures that we'll explore in subsequent sections.

9.1.1. Creating and Accessing a Two-Dimensional Array

Here's how to put together a two-dimensional array:

# Assign a list of array references to an array.
@AoA = (
         [ "fred", "barney" ],
         [ "george", "jane", "elroy" ],
         [ "homer", "marge", "bart" ],
);

print $AoA[2][1];   # prints "marge"

The overall list is enclosed by parentheses, not brackets, because you're assigning a list and not a reference. If you wanted a reference to an array instead, you'd use brackets:

# Create an reference to an array of array references.
$ref_to_AoA = [
    [ "fred", "barney", "pebbles", "bamm bamm", "dino", ],
    [ "homer", "bart", "marge", "maggie", ],
    [ "george", "jane", "elroy", "judy", ],
];

print $ref_to_AoA->[2][3];   # prints "judy"

Remember that there is an implied -> between every pair of adjacent braces or brackets. Therefore these two lines:

$AoA[2][3]
$ref_to_AoA->[2][3]

are equivalent to these two lines:

$AoA[2]->[3]
$ref_to_AoA->[2]->[3]

There is, however, no implied -> before the first pair of brackets, which is why the dereference of $ref_to_AoA requires the initial ->. Also remember that you can count backward from the end of an array with a negative index, so:

$AoA[0][-2]

is the next-to-last element of the first row.

9.1.2. Growing Your Own

Those big list assignments are well and good for creating a fixed data structure, but what if you want to calculate each element on the fly, or otherwise build the structure piecemeal?

Let's read in a data structure from a file. We'll assume that it's a plain text file, where each line is a row of the structure, and each line consists of elements delimited by whitespace. Here's how to proceed:[1]

while (<>) {
    @tmp = split;           # Split elements into an array.
    push @AoA, [ @tmp ];    # Add an anonymous array reference to @AoA.
}

Of course, you don't need to name the temporary array, so you could also say:

while (<>) {
    push @AoA, [ split ];
}

If you want a reference to an array of arrays, you can do this:

while (<>) {
    push @$ref_to_AoA, [ split ];
}

Both of those examples add new rows to the array of arrays. What about adding new columns? If you're just dealing with two-dimensional arrays, it's often easiest to use simple assignment:[2]

for $x (0 .. 9) {                       # For each row...
    for $y (0 .. 9) {                   # For each column...
        $AoA[$x][$y] = func($x, $y);    # ...set that cell
    }
}

for $x ( 0..9 ) {                       # For each row...
    $ref_to_AoA->[$x][3] = func2($x);   # ...set the fourth column
}

It doesn't matter in what order you assign the elements, nor does it matter whether the subscripted elements of @AoA are already there or not; Perl will gladly create them for you, setting intervening elements to the undefined value as need be. (Perl will even create the original reference in $ref_to_AoA for you if it needs to.) If you just want to append to a row, you have to do something a bit funnier:

# Append new columns to an existing row.
push @{ $AoA[0] }, "wilma", "betty";

Notice that this wouldn't work:

push $AoA[0], "wilma", "betty";  # WRONG!

That won't even compile, because the argument to push must be a real array, not just a reference to an array. Therefore, the first argument absolutely must begin with an @ character. What comes after the @ is somewhat negotiable.

[1] Here as in other chapters, we omit (for clarity) the my declarations that you would ordinarily put in. In this example, you'd normally write my @tmp = split.

[2] As with the temp assignment earlier, we've simplified; the loops in this chapter would likely be written for my $x in real code.

9.1.3. Access and Printing

Now let's print the data structure. If you only want one element, this is sufficient:

print $AoA[3][2];

But if you want to print the whole thing, you can't just say:

print @AoA;         # WRONG

It's wrong because you'll see stringified references instead of your data. Perl never automatically dereferences for you. Instead, you have to roll yourself a loop or two. The following code prints the whole structure, looping through the elements of @AoA and dereferencing each inside the print statement:

for $row ( @AoA ) {
    print "@$row\n";
}

If you want to keep track of subscripts, you might do this:

for $i ( 0 .. $#AoA ) {
    print "row $i is: @{$AoA[$i]}\n";
}

or maybe even this (notice the inner loop):

for $i ( 0 .. $#AoA ) {
    for $j ( 0 .. $#{$AoA[$i]} ) {
        print "element $i $j is $AoA[$i][$j]\n";
    }
}

As you can see, things are getting a bit complicated. That's why sometimes it's easier to use a temporary variable on your way through:

for $i ( 0 .. $#AoA ) {
    $row = $AoA[$i];
    for $j ( 0 .. $#{$row} ) {
        print "element $i $j is $row->[$j]\n";
    }
}

9.1.4. Slices

If you want to access a slice (part of a row) of a multidimensional array, you're going to have to do some fancy subscripting. The pointer arrows give us a nice way to access a single element, but no such convenience exists for slices. You can always extract the elements of your slice one-by-one with a loop:

@part = ();
for ($y = 7; $y < 13; $y++) {
    push @part, $AoA[4][$y];
}

This particular loop could be replaced with an array slice:

@part = @{ $AoA[4] } [ 7..12 ];

If you want a two-dimensional slice, say, with $x running from 4..8 and $y from 7..12, here's one way to do it:

@newAoA = ();
for ($startx = $x = 4; $x <= 8; $x++) {
    for ($starty = $y = 7; $y <= 12; $y++) {
        $newAoA[$x - $startx][$y - $starty] = $AoA[$x][$y];
    }
}

In this example, the individual values within our destination two-dimensional array, @newAoA, are assigned one by one, taken from a two-dimensional subarray of @AoA. An alternative is to create anonymous arrays, each consisting of a desired slice of an @AoA subarray, and then put references to these anonymous arrays into @newAoA. We would then be writing references into @newAoA (subscripted once, so to speak) instead of subarray values into a twice-subscripted @newAoA. This method eliminates the innermost loop:

for ($x = 4; $x <= 8; $x++) {
    push @newAoA, [ @{ $AoA[$x] } [ 7..12 ] ];
}

Of course, if you do this often, you should probably write a subroutine called something like extract_rectangle. And if you do it very often with large collections of multidimensional data, you should probably use the PDL (Perl Data Language) module, available from CPAN.

9.1.5. Common Mistakes

As mentioned earlier, Perl arrays and hashes are one-dimensional. In Perl, even "multidimensional" arrays are actually one-dimensional, but the values along that dimension are references to other arrays, which collapse many elements into one. If you print these values out without dereferencing them, you will get the stringified references rather than the data you want. For example, these two lines:

@AoA = ( [2, 3], [4, 5, 7], [0] );
print "@AoA";

result in something like:

ARRAY(0x83c38) ARRAY(0x8b194) ARRAY(0x8b1d0)

On the other hand, this line displays 7:

print $AoA[1][2];

When constructing an array of arrays, remember to compose new references for the subarrays. Otherwise, you will just create an array containing the element counts of the subarrays, like this:

for $i (1..10) {
    @array = somefunc($i);
    $AoA[$i] = @array;       # WRONG!
}

Here @array is being accessed in a scalar context, and therefore yields the count of its elements, which is dutifully assigned to $AoA[$i]. The proper way to assign the reference will be shown in a moment.

After making the previous mistake, people realize they need to assign a reference, so the next mistake people naturally make involves taking a reference to the same memory location over and over again:

for $i (1..10) {
    @array = somefunc($i);
    $AoA[$i] = \@array;      # WRONG AGAIN!
}

Every reference generated by the second line of the for loop is the same, namely, a reference to the single array @array. Yes, this array changes on each pass through the loop, but when everything is said and done, $AoA contains 10 references to the same array, which now holds the last set of values assigned to it. print @{$AoA[1]} will reveal the same values as print @{$AoA[2]}.

Here's a more successful approach:

for $i (1..10) {
    @array = somefunc($i);
    $AoA[$i] = [ @array ];   # RIGHT!
}

The brackets around @array create a new anonymous array, into which the elements of @array are copied. We then store a reference to that new array.

A similar result--though more difficult to read--would be produced by:

for $i (1..10) {
    @array = somefunc($i);
    @{$AoA[$i]} = @array;
}

Since $AoA[$i] needs to be a new reference, the reference springs into existence. Then, the preceding @ dereferences this new reference, with the result that the values of @array are assigned (in list context) to the array referenced by $AoA[$i]. You might wish to avoid this construct for clarity's sake.

But there is a situation in which you might use it. Suppose @AoA is already an array of references to arrays. That is, you've made assignments like:

$AoA[3] = \@original_array;

And now suppose that you want to change @original_array (that is, you want to change the fourth row of $AoA) so that it refers to the elements of @array. This code will work:

@{$AoA[3]} = @array;

In this case, the reference itself does not change, but the elements of the referenced array do. This overwrites the values of @original_array.

Finally, the following dangerous-looking code actually works fine:

for $i (1..10) {
    my @array = somefunc($i);
    $AoA[$i] = \@array;
}

That's because the lexically scoped my @array variable is created afresh on each pass through the loop. So even though it looks as though you've stored the same variable reference each time, you haven't. This is a subtle distinction, but the technique can produce more efficient code, at the risk of misleading less-enlightened programmers. (It's more efficient because there's no copy in the final assignment.) On the other hand, if you have to copy the values anyway (which the first assignment in the loop is doing), then you might as well use the copy implied by the brackets and avoid the temporary variable:

for $i (1..10) {
    $AoA[$i] = [ somefunc($i) ];
}

In summary:

$AoA[$i] = [ @array ];   # Safest, sometimes fastest
$AoA[$i] = \@array;      # Fast but risky, depends on my-ness of array
@{ $AoA[$i] } = @array;  # A bit tricky

Once you've mastered arrays of arrays, you'll want to tackle more complex data structures. If you're looking for C structures or Pascal records, you won't find any special reserved words in Perl to set these up for you. What you get instead is a more flexible system. If your idea of a record structure is less flexible than this, or if you'd like to provide your users with something more opaque and rigid, then you can use the object-oriented features detailed in Chapter 12, "Objects".

Perl has just two ways of organizing data: as ordered lists stored in arrays and accessed by position, or as unordered key/value pairs stored in hashes and accessed by name. The best way to represent a record in Perl is with a hash reference, but how you choose to organize such records will vary. You might want to keep an ordered list of these records that you can look up by number, in which case you'd use an array of hash references to store the records. Or, you might wish to look the records up by name, in which case you'd maintain a hash of hash references. You could even do both at once, with pseudohashes.

In the following sections, you will find code examples detailing how to compose (from scratch), generate (from other sources), access, and display several different data structures. We first demonstrate three straightforward combinations of arrays and hashes, followed by a hash of functions and more irregular data structures. We end with a demonstration of how these data structures can be saved. These examples assume that you have already familiarized yourself with the explanations set forth earlier in this chapter.