Generators in PHP, new generation of simple iterators

Every time a new version of PHP is released, I feel like a child opening Christmas gifts. There are a lot of sweet cookies behind the hood, which can make your code more elegant and faster. One of them is generators. This concept was introduced in PHP v. 5.5. Unfortunately dry and quite abstract examples of this feature make it hard to understand, accept and put into life. So this text was intended to describe some of the practical use cases where generators give more power to your code.

Generators and the yield statement

The official description for this feature can be found here. Generators can be used as slim and powerful iterators, they can help you to organise the life cycle of the resources more efficient and they give a very agile way of creating lazy collection without loops repetition.

Generators help to create iterators very easily

Php generators are nothing more than just iterators that help you to solve the problem of looping over a big collection of data with minimal memory consumption. So generators implement the well known Iterator pattern that let you to create large and lazy data sets. "Lazy" in this case means that you create some abstraction that potentially has access to a big data set but doesn't load the it into the memory until you start looping over it. Imagine that you have some over 1M rows large csv file or a comparable list of values in a cache storage e.g. with user messages. You want to calculate some rank value for every row and download the generated list from the browser. Most obvious strategy is to read this data set into an array, loop over it and apply your ranking function to every element. This might look like this:

<?php
class DataProvider
{
    /**
     * @return array
     */
    public function getRankedRows()
    {
        $dataSource = new LargeDataSource();
        $rankCalculator = new RankCalculator();
        $rows = $dataSource->getRecords(1000000);
        $resultRows = [];
        foreach ($rows as $row) {
            $rank = $rankCalculator->calculateRank($row);
            $resultRows[$row['id']] = $rank;
        }

        return $resultRows;
    }
}

...
//somewhere in the view
$dataProvider = new DataProvider();
foreach ($dataProvider->getRankedRows() as $rowId => $rowValue){
    echo $rowId . ':' . $rowValue;
    flush();
}

As you can see we've put a big collection of rows into memory and applied rank information to every element in it. The whole operation would cost us probably a lot of RAM. Moreover we do 2 iterations on millions of rows, this will also increase the execution time. To get rid of 1 loop we can move rank calculation to the view, but this adds another responsibility to the representation layer. Imagine we want serve the rank calculation in a json format, so we have to repeat the same loop in another representation. So the view is definitely a wrong place to do it. Let's try another approach. We are going to create an Iterator:

class DataProvider implements Iterator
{
    private $dataSource;

    private $rankCalculator;

    private $position = 0;

    public function __construct()
    {
        $this->dataSource = new LargeDataSource();
        $this->rankCalculator = new RankCalculator();
    }

    public function current()
    {
        $row =  $this->dataSource->getRecord($this->position);
        $row['rank'] = $this->rankCalculator->calculateRank($row);
        
        return $row;
    }

    public function next()
    {
        ++$this->position;
    }

    public function key()
    {
        return $this->position;
    }

    public function valid()
    {
        return $this->largeDataSet->exists($this->position);
    }

    public function rewind()
    {
        $this->position = 0;
    }
}

...
//somewhere in the view
$dataProvider = new DataProvider();
foreach ($dataProvider as $row){
    echo $row['id'] . ':' . $row['rank'];
    flush();
}...

This approach has huge advantages. We still keep responsibility of ranking outside of the view and we don't load the whole data collection into the memory. The view gets the DataProvider Iterator which has only the "promised" iteration logic rather than data itself. This is called a "lazy" loading strategy because our iterator doesn't do anything until it gets involved into a loop. With this approach we need only one loop: the rank calculation is performed on the flow:

echo $row['id'] . ':' . $row['rank']; //when we are on this line, php internally calls the current() method of $dataProvider, which does the rank calculation

Moreover there is no need to hold the whole rows collection in the memory! We are just streaming it directly to the client and the memory is used only for storing one single row. The already sent datasets are removed from the memory by php. This is perfect and now it's time to replace the iterator with a generator, which is actually the same thing. Let's look at the rewritten code:

class DataProvider implements IteratorAggregate
{
    private $dataSource;

    private $rankCalculator;

    public function __construct()
    {
        $this->dataSource = new LargeDataSource();
        $this->rankCalculator = new RankCalculator();
    }
    //we have to do this method to implement IteratorAggregate
    public function getIterator()
    {
        $key = 0;
        foreach ($this->dataSource->getRow($key) as $row) {
            $row['rank'] = $this->rankCalculator->calculateRank($row);
            $key++;
            yield $row;
        }
    }
}

The function with a "yield" statement is called from now on a Generator.  In the example above the function getIterator() will return a generator (which as we know also an iterator) because it has yield operator in it. It sounds strange but you can prove it by calling

$iterator = (new DataProvider())->getIterator();
get_class($iterator); //will output Generator

To output the result we still use the same code:

$dataProvider = new DataProvider();
foreach ($dataProvider as $row){
    echo $row['id'] . ':' . $row['rank'];
    flush();
}...

You might have noticed that we have 2 loops again, but they are executed synchronously. Once we ask the data provider to return the next element it internally calls the iteration logic inside the DataProvider until it reaches the "yield" statement. What happens on the "yield" line? Once we call

foreach ($this->dataProvider as $row)

on the first iteration, this code will be executed:

$key = 0;
foreach ($this->dataSource->getRow($key) as $row) {
   $row['rank'] = $this->rankCalculator->calculateRank($row);
   $key++;
...

Once we reach the yield instruction, we jump to the echo $row['rank'] in the view iteration and then jump back to the begin of the loop. Then the second iteration of foreach ($this->dataProvider as $row) is triggered. After it we jump to this code:

//skipping $key = 0 line, and holding the variable value $key === 1;
foreach ($this->dataSource->getRow($key) as $row) {
            $row['rank'] = $this->rankCalculator->calculateRank($row);
            $key++;

As you might have noticed we didn't implement any single method of Iterator interface but still created a similar and slimmer functionality just by using a function with "yield" instruction. Another use case for generators could be the logic for releasing external resources like connections or file handles. Let's have a look at it.

Easy logic for releasing resources in Iterators

Let's continue with the previous example.  Imagine you need to open a connection to the LargeDataSource before making a request but only once and close it after the loop is finished. With the "classical" Iterator I might solve it like this:

class DataProvider implements Iterator
{
    ...

    public function __construct()
    {
        $this->dataSource = new LargeDataSource();
        $this->dataSource->connect();
        ...

But this is very bad because I open a connection but don't do anything (it will happen later when I start iterating). Maybe I will create DataProvider but will never iterate over it because of some conditions so I don't need to open the connection in the constructor. I would suggest something like this:

class DataProvider implements Iterator
{
...
 public function current()
 {
    if(!$this->dataSource->isConnected()) {
        $this->dataSource->connect();
    }
...

The solution is "lazy" enough, as I open the connection only once and on the request for the first element. For all other elements the connection will be open so I will not do it again. With a little overhead for the additional check we've implemented the resources' lazy loading. To release the connection I would suggest something like:

class DataProvider implements Iterator
{
 ...
 public function __destruct()
 {
    if ($this->dataSource->isConnected()) {
        $this->dataSource->disconnect();
    }
 }
 ...

It's not a perfectly clean solution as I rely on the destruct method called by php garbage collector which logic I cannot control. I once even experienced that the destructor method was not called (withing a noticeable period of time). Let's return to our generator code:

class DataProvider implements IteratorAggregate
{
    ...
    public function getIterator()
    {
        $this->dataSource->connect();
        $key = 0;
        foreach ($this->dataSource->getRow($key) as $row) {
            $row['rank'] = $this->rankCalculator->calculateRank($row);
            $key++;
            yield $row;
        }
        $this->dataSource->disconnect();
    }
}

Unbelievable but this will perfectly work. Once the first iteration is executed we connect to the data source. As we know "yield" instruction will return $row to the main loop in the view and freeze the loop in the generator in the current state (where $i>0). But moreover once the main loop ends, we execute this code

$this->dataSource->disconnect()

at the right moment exactly after the iteration. We gain the full control over the life cycle of our connection! We require a resource once just before the iteration and release it after the iteration. You can use this logic also for controlling the file handles. You open a file in the read mode, iterate over it and close the handler after the reading finishes. No more releasing on non reliable destruction! Let's apply php generators to some real case task. Imagine a company which aggregates prices for electronic goods from different retailers.

The main goal is to generate some price comparison analytics (e.g. average price for each good, comparison matrix,  price dynamics, max and min price etc.). Every retailer uploads every day a list of codes for goods with prices to some location. You definitely will need to upload those files into your database everyday or even more often. For the simplification imagine that every file has the csv format. What differs is the number of columns, their positions, the separators and format of ids. Every file should be validated against a set of rules. Prices' format should be adjusted to fulfil the company's requirements.

For every single row we want to validate the number of columns and the format of the product id and we want to convert the price to an integer format (e.g. $29.9 to 299). As we cannot upload every single file into the memory we have to use iterators. We also want to filter out lines with a wrong amount of columns and wrong codes format. Let's define a Filter interface:

interface Filter
{
    /**
     * @param array $row
     * @return bool
     */
    public function matches($row);
}

We also want to convert prices to some universal integer format, but as the price formats might differ from retailer to retailer, so let's create another interface and call it Mutator:

interface Mutator
{
    /**
     * @param array $row
     * @return array
     */
    public function modify($row);
}

Further on we will need 2 filters and 1 mutator.

class ColumnsCountFilter implements Filter
{
    /**
     * @var int
     */
    private $expectedCount;

    /**
     * @param int $expectedCount
     */
    public function __construct($expectedCount)
    {
        $this->expectedCount = $expectedCount;
    }

    /**
     * @param array $row
     * @return bool
     */
    public function matches($row)
    {
        return $this->expectedCount == count($row);
    }
}
class CodesFormat implements Filter
{
    /**
     * @var string
     */
    private $expectedPattern;

    /**
     * @var int
     */
    private $codeColumnNumber;

    /**
     * @param string $expectedPattern
     * @param int $codeColumnNumber
     */
    public function __construct($expectedPattern, $codeColumnNumber)
    {
        $this->expectedPattern = $expectedPattern;
        $this->codeColumnNumber = $codeColumnNumber;
    }

    /**
     * @param array $row
     * @return bool
     */
    public function matches($row)
    {
        if (empty($row[$this->codeColumnNumber])) {
            return false;
        }

        return preg_match($this->expectedPattern, $row[$this->codeColumnNumber]);
    }
}
class PriceConverter implements Mutator
{
    /**
     * @var int
     */
    private $priceColumnNumber;

    /**
     * @param int $priceColumnNumber
     */
    function __construct($priceColumnNumber)
    {
        $this->priceColumnNumber = $priceColumnNumber;
    }

    /**
     * @param array $row
     * @throws Exception
     * @return array
     */
    public function modify($row)
    {
        if (empty($row[$this->priceColumnNumber])) {
            throw new Exception('Wrong price column number');
        }
        $row[$this->priceColumnNumber] = str_replace(array('$', '.', ','), '', $row[$this->priceColumnNumber]);

        return $row;
    }
}

The classes are self explanatory.  ColumnsCountFilter validates the amount of rows,  CodesFormat rejects rows with wrong codes and PriceConverter converts prices to integers. As an example I have taken a file in the following format:

121,$91.0,
122,$99.0,
123,$100.0,

Let's create a php Generator for importing a csv file:

$csvUploaderFunc = function ($fileName, $delimiter) {
    $f = fopen($fileName, 'r');
    if (!$f) throw new Exception();
    while ($line = fgetcsv($f, null, $delimiter)) {
        yield $line;
    }
    fclose($f);
};

Our next step is to implement an iterator for filtering:

$filterFunc = function (Iterator $iterator, Filter $filter) {
    while ($iterator->valid()) {
        $value = $iterator->current();
        if ($filter->matches($value)) {
            yield $value;
        }
        $iterator->next();
    }
};

And the last one is an iterator for changing values:

$mutatorFunc = function (Iterator $iterator, Mutator $mutator) {
    while ($iterator->valid()) {
        yield $mutator->modify($iterator->current());
        $iterator->next();
    }
};

Having this done we can combine all of them into one Iterator using the decorator pattern:

$linesCollection = $csvUploaderFunc('/retailer1.csv', ',');
$linesCollection = $filterFunc($linesCollection, new ColumnsCountFilter(3));
$linesCollection = $filterFunc($linesCollection, new CodesFormat('#^\d*$#', 0));
$linesCollection = $mutatorFunc($linesCollection, new PriceConverter(1));

Let's execute the iteration:

foreach ($linesCollection as $dat) {
    echo json_encode($dat);
}

The output will be:

["121","910",""]["122","990",""]["123","1000",""]

You can test this code by modifying the csv file - just add another column at any line, or change the number format in the first column to non number, you will see that those lines will be discarded in the output. The advantages of the suggested solution are:

  1. The code is very slim. Try to implement similar functionality with Iterator interface. Your code will be double bigger.
  2. We don't load the whole file into the memory, we just process it line by line. In the real world situation you would directly inject every row (or maybe a batch of them) into a database.
  3. The whole operation is executed with one single loop!
  4. We are providing a really flexible solution. You can easily add new filters and mutators, you can vary any rule for any retailer independently.
  5. We can enjoy all the advantages of a lazy solution. We can attach filters and modifiers in any part of the code with any business logic just by passing the main iterator over it, which will grow in the functionality but keeping the nature of an iterator. Someone might say that this code became more functional rather than object oriented. Ok, let's try to predict what the following code will output:
echo get_class($linesCollection);

To my astonishment it will output a fully qualified class name: \Generator! Of course it implements the Iterator interface! So this means, we can hide all the complexity of the generator/iterator definition in a factory class:

class IteratorsFactory
{
    /**
     * @param string $fileName
     * @param string $delimiter
     * @return \Iterator
     */
    public function getFileIterator($fileName, $delimiter)
    {
        $funct = function ($fileName, $delimiter) {
            $f = fopen($fileName, 'r');
            if (!$f) throw new Exception();
            while ($line = fgetcsv($f, null, $delimiter)) {
                yield $line;
            }
            fclose($f);
        };

        return $funct($fileName, $delimiter);
    }

    /**
     * @param Iterator $iterator
     * @param Filter $filter
     * @return \Iterator
     */
    public function getFilterIterator(Iterator $iterator, Filter $filter){
        $filterFunc = function (Iterator $iterator, Filter $filter) {
            while ($iterator->valid()) {
                $value = $iterator->current();
                if ($filter->matches($value)) {
                    yield $value;
                }
                $iterator->next();
            }
        };

        return $filterFunc($iterator, $filter);
    }

    /**
     * @param Iterator $iterator
     * @param Mutator $mutator
     * @return \Iterator
     */
    public function getMutatorIterator(Iterator $iterator, Mutator $mutator){
        $mutatorFunc = function (Iterator $iterator, Mutator $mutator) {
            while ($iterator->valid()) {
                yield $mutator->modify($iterator->current());
                $iterator->next();
            }
        };

        return $mutatorFunc($iterator, $mutator);
    }
}

Then we change the iterators creation to:

$iteratorsFactory = new IteratorsFactory();
$linesCollection = $iteratorsFactory->getFileIterator('/Users/breathbath/retailer1.csv', ',');
$linesCollection = $iteratorsFactory->getFilterIterator($linesCollection, new ColumnsCountFilter(3));
$linesCollection = $iteratorsFactory->getFilterIterator($linesCollection, new CodesFormat('#^\d*$#', 0));
$linesCollection = $iteratorsFactory->getMutatorIterator($linesCollection, new PriceConverter(1));

As you see our code has become purely OOP designed. What is left to discuss is the possibility to send parameters into generator. This could look like this:

function dataProcessor($data)
{
    for ($i = 0; $i < count($data); $i++) {
        $toBreak = (yield $data[$i]);
        if ($toBreak) {
            break;
        }
    }
}

$data = str_split('pink umbrella ', 1);
$dataIter = dataProcessor($data);

$dat = '';
while ($dataIter->valid()) {
    $curr = $dataIter->current();
    if ($curr == ' ') {
        $dataIter->send(true);
    }
    $dat .= $curr;
    $dataIter->next();
}
echo $dat;

The output will be:

pink

Look at the string:

$toBreak = (yield $data[$i]);

Split it into "yield $data[$i]" - this returns the current element of collection and "$toBreak = yield" - this reads whatever you put into the "send" method. It is declared in the "Generator" interface and might be used for jumping in into the iterator's loop.

Generators in PHP 7

In PHP 7 the developers of the language added some more useful features to generators particularly generator return expressions and generator delegations. Let's try to find a practical way to approach these new features.

Generator return expressions

Let's start with an example. Imagine we have a general queue service that receives media data objects as DTOs (data transfer objects) from different parts of application for their later processing and persistence. For example for all uploaded images we want to trigger resizing and minifying tasks. So let's create a simple DTO:

class ImageDto
{
    public $location;

    /**
     * ImageDto constructor.
     * @param $location
     */
    public function __construct(string $location)
    {
        $this->location = $location;
    }
}

Let's implement the queue service that will hold the queued items and return them to a requester in FIFO order:

class QueueService
{
    /**
     * @var SplQueue
     */
    private $queue;

    /**
     * @var array
     */
    private $secureDtoClasses;

    /**
     * QueueService constructor.
     * @param array $secureDtoClasses
     */
    public function __construct(array $secureDtoClasses = [])
    {
        $this->queue = new SplQueue();
        $this->secureDtoClasses = $secureDtoClasses;
    }

    /**
     * @return Generator
     */
    public function read()
    {
        foreach ($this->queue as $item) {
            yield unserialize($item, ["allowed_classes" => $this->secureDtoClasses]); //we use another feature of php 7 called filtered unserialise
        }
    }

    /**
     * @param object $dto
     */
    public function add($dto)
    {
        $dtoSerialised = serialize($dto);
        $this->queue->enqueue($dtoSerialised);
    }
}

In the code above we use a generator for the queue service in order to spare iterator implementation. The filtered unserialization helps us to convert only the objects we know (in this case ImageDto), all unknown objects will be converted to a __PHP_Incomplete_Class instance which allows you to use unserialization in a secure way. Now I want to implement a function that will filter out not ImageDto instances, and store their amount:

$readImagesFromQueueFunc = (function () use ($queueService) {
    $unrecognisedItemsCount = 0;
    foreach ($queueService->read() as $dequeuedItem) {
        if (!$dequeuedItem instanceof ImageDto) {
            $unrecognisedItemsCount++;
            continue;
        }
        yield $dequeuedItem;
    }
    return $unrecognisedItemsCount;
})();

As you see it's also another generator that iterates over our first QueueServce generator, returns only the images and store the amount of unrecognised classes. This amount is returned after the loop. It could be important if you want to monitor the dynamic of noisy data in the queue and maybe perform some security checks. Let's create our queue service:

$queueService = new QueueService([ImageDto::class]);
$queueService->add(new ImageDto('location1'));
$queueService->add(new ImageDto('location2'));
$queueService->add(new ImageDto('location3'));
$queueService->add(new stdClass());

We simulate here the situation, where the queue will store different types of DTOs (as it is data format agnostic). Let's try to read only the images from our queue:

foreach ($readImagesFromQueueFunc as $imageItem) {
    $this->processImageItem ($imageItem);
}

In this place you can see the full power of generator's concept as behind this simple loop there are 2 other loops (the queue reading service and image filtering iterator) which are run synchronously rather than executed sequentially! Php 7 introduced possibility to read return values from our generators. In this case after processing all queued images I want to see the amount of not processed messages (because they had a wrong class):

echo $readImagesFromQueue->getReturn();

This feature is really useful especially if you want to collect monitoring statistics from all the generators you used.

Generator delegation

A short introduction to it might be found here. In fact this feature allows you to merge generators to have a single convenient way to access them. Imagine that we have a service that gives information about all known buyers. We also have a completely another service that "knows" everything about our potential customers. Every service has it's own context and own special representation of the client. So it's not right to say that they both operate on the same "customer" business entity. Let's implement each of them:

class BuyersProvider
{
    /**
     * @var array
     */
    private $orders = [
        ['id' => 'order1', 'client' => ['name' => 'Bob']],
        ['id' => 'order2', 'client' => ['name' => 'Alice']],
    ];

    /**
     * @return Generator
     */
    public function getBuyers()
    {
        return (function () {
            foreach ($this->orders as $order) {
                yield $order['client']['name'];
            }
        })();
    }
}

As you see the first service observes customer in an order context. You also might noticed that it returns a generator as a collection of buyers' names (as any function containing yield expression is handled as a generator). Let's create another one:

class PotentialCustomersProvider implements IteratorAggregate
{
    /**
     * @var array
     */
    private $customers = [
        ['id' => 'client1', 'name' => 'Sam', 'emailSent' => true],
        ['id' => 'client2', 'name' => 'Dick', 'emailSent' => false],
    ];

    /**
     * @return Generator
     */
    public function getIterator()
    {
        foreach ($this->customers as $customer) {
            yield $customer['name'];
        }
    }
}

For this service a customer is a contact that has received or not received a promotion email. It's also remarkable that we used OOP style to create another generator instance and as with the case with lambda function in the first service here we also return a generator. Imagine that we need a service that provides all the known customers independent from the department where this information was collected. To do this we just need to merge both generators and php 7 will help us:

//let's instantiate both services
$buyersGenerator = (new BuyersProvider())->getBuyers();
$customersGenerator = (new PotentialCustomersProvider())->getIterator();

//here we merge them together
$getAllClientsFunc = (function () use ($buyersGenerator, $customersGenerator) {
    yield from $buyersGenerator;
    yield from $customersGenerator;
})();

//here we iterate over the merged colleciton
foreach ($getAllClientsFunc as $clientName) {
    echo $clientName;
}

The output will be the list of all known customer names: -> Bob -> Alice -> Sam -> Dick Notice that we didn't do any loops in the merged generator, as it was needed in the previous versions of php like this:

$getAllClientsFunc = (function () use ($buyersGenerator, $customersGenerator) {
    foreach ($buyersGenerator as $b) {
        yield $b;
    }
    foreach ($customersGenerator as $c) {
        yield $c;
    }
})();

Summary

New php Generators provide a slim, faster and functional way of implementing iterators. They can be used in any cases where you can imagine the iterator, but also they provide a more controllable syntax for creating and releasing resources such as connections or file handlers. With all the functional nature of Generators they can be perfectly integrated into OOP paradigm just by using the Factory pattern.  You can also affect the iteration logic just by sending values directly into the yield statement used on the right side of an expression. PHP 7 made possible to receive some feedback from the executed generators by calling getReturn() function which is piped to the return statement in the generator execution.  PHP 7 also gave a handy way to merge generators without the need to looping over them in the "parent" generator.