F...ancy PHP: immutable data


27 April 2018

This series is a challenge: what features of the more well-regarded languages does PHP already have? What others can it emulate?

In this article I will tackle immutable data. Spoiler: this is not how PHP chooses to approach data. However, emulating this functionality results in code that is easier to understand and debug. Furthermore, examining in depth how PHP treats data "under the hood" helps improve code performance. Let's dive in!

Why?

Immutable data structures means that once you create a data structure, its contents cannot be modified. People value immutable data because it decreases uncertainty.

It's easier to track what's happening to data, which means that it's easier to understand code and find bugs. Immutable data protects you against concurrency conflicts due to shared state and various other heisenbugs.

Bad news: immutable data structures tend to use more memory (unless the language is specifically optimised for them, which PHP isn't). Also, PHP isn't exactly known for its concurrency ¯\_(ツ)_/¯

A value by any other name

When processing data in PHP, you will mostly be operating on scalars (int: 1, string "Cthulhu", boolean: true) and compound types (array: [1,1,2,3] and object: new \ArrayIterator([0,1])).

To store data, you assign a value to a variable, for which you need a mechanism that allows you to associate the variable name with the stored value. That mechanism is called a symbol table (where the name of the variable is, in this case, a symbol). Every scope has its own symbol table. (Additionally, each array or object has its own symbol tables for keys, methods and properties.)

There are two ways a variable can refer to a value: either directly, or as a reference to another variable. PHP references function as aliases (for other symbols) and are often explained by an analogy to hard links in a filesystem.

$a = 1;
$b = $a;

$b = 2;

$a = 1;
$b = &$a;

$b = 2;

From this we can conclude that references are evil break the guarantee of immutable data: they allow you to modify contents of a variable without referring to the variable directly.

Passing by value, passing by reference

So what about function arguments? They do seem like an alias pointing to the same value, but they're not. In fact, unless you declare that an argument is passed by reference, calling a function creates new copies of arguments.

$a = 1;
(function ($x) { $x = 2; })($a);

$a = 1;
(function (&$x) { $x = 2; })($a);

Again, references mess up any guarantees of independence.

Are objects passed by reference?

You'll have heard that in PHP, "objects are passed by reference" while "everything else is passed by value". Well, not quite. What PHP stores as object "values" are actually object identifiers, which can then be used to lookup the actual object in some other location in memory. (Look up spl_object_id - or spl_object_hash for versions earlier than PHP 7.2) You can think of them as pointers (and just as with pointers, once an object has been garbage collected, the identifier may be reused to store some other object.)

When you pass an object by value, what actually happens is that the function symbol table acquires a copy of the identifier, which still points to the same object. Like so:

$a = new \ArrayObject([1, 2]);
(function ($x) { $x[0] = 'head'; })($a);

The distinction between an object identifier and a reference is important, because it explains how objects will behave in more complex situations. For example, if you pass an array of objects to a function, the function argument will contain a copy of the array with a copy of the pointers, which will then point to the same object bodies as the external scope. Observe:

$a = [[1, 2], 2];
(function ($x) { $x[0][0] = 'head'; })($a);

// $a still contains [[1, 2], 2]

$a = [new \ArrayObject([1, 2]), 2];
(function ($x) { $x[0][0] = 'head'; })($a);

// however NOW $a contains [\ArrayObject(['head', 2]), 2]

In short: if you use objects in PHP, you implicitly fail the immutability guarantee. (Resources, like file handles and streams, exhibit the same problem.)

It's sheep all the way down

...or do you? While PHP doesn't automatically create a copy of an object, it is possible to do so manually, using the clone keyword. A cloned object contains a copy of the properties of the original. To follow on the example above:

$a = [new \ArrayObject([1, 2]), 2];
(function ($x) { $x = clone $x; $x[0][0] = 'head'; })($a);

// $a still contains [\ArrayObject([1, 2]), 2]

There are however gotchas to this solution.

Firstly, you can only clone an object. It's easy to remember why: after copying property values, clone attempts to calls a magic method __clone() on the object to finalise setting up the clone. Since scalars and arrays cannot define methods, they cannot be cloned and an attempt to do so will generate a warning. (Implementing __clone() is optional, but it can be useful for handling things that cannot be duplicated automatically.)

Furthermore, what you get is called a shallow clone. As in the previous example, if an object has properties referring to other objects, the clone will contain copies of object pointers which still point to the same object bodies.

The opposite of a shallow clone is a deep clone, which means you have to implement a method on your class that will clone all properties referring to objects, and then their properties referring to objects, and then...

As you can see, deep cloning is expensive.

Constants: not a solution

PHP constants do solve the problem of immutability in trivial cases, so I'm going to cover them only for completeness' sake.

Constants permanently bind a name to a certain value. Constant values can only be scalars or arrays of scalars (nesting permitted), which makes sense given what we've just covered with regards to objects and immutability.

In PHP, you refer to a constant by its name without $ prefix, or by calling constant() method (there is no way to dereference a constant in the same ways as with a variable, that is: $$varName.)

const keyword allows you to define constants in local scope, at compile time. Since they are defined at compile time, they can only contain static values or results of simple expressions such as 2 + 2 or 2 * ANOTHER_CONSTANT.

define() function defines constants in global scope, at runtime. This is why they can contain results of function calls and more complicated expressions.

Neither is really sufficient to emulate immutable data. As not everything can be calculated at compile time, const is not good enough. define() is even worse, since it adds everything to global scope.

Is it worth it?

PHP doesn't provide immutable data structures out of the box (except with constants, which, as we have covered above, have limited usability). You can approximate this feature through mental discipline. See if you can make it impossible for any piece of data that you can refer to in your current scope to be modified by external factors.

This challenge will force you to pay attention to how you pass data around, where you allow it to be modified, where the memory gets used. It will get you thinking about efficiency, safety, consistency and design. You will definitely keep using objects and perhaps references anyway - but you will be more conscious of the tradeoffs you make, and of possible risks and side effects.

As a result, you will become a better engineer.

Further reading

Tags: hack php