Tidying up your HTML with PHP

Bye bye, regular expressions!

Thanks to the tidy parser, no longer do you have to rely on regular expressions to extract content from another HTML document. Rather, you can use a much more maintainable solution!

In the following example (found in the ext/tidy/examples/ directory), I use the tidy parser to extract all of the links from an arbitary HTML document

Grabbing URLs from an HTML document


<?php
    function dump_nodes(tidy_node $node, &$urls = NULL) {

        $urls = (is_array($urls)) ? $urls : array();
    
        if(isset($node->id)) {
            if($node->id == TIDY_TAG_A) {
                $urls[] = $node->attribute['href'];
                }
        }
            
        if($node->hasChildren()) {

            foreach($node->child as $c) {

                dump_nodes($c, $urls);
        
            }

        }
    
        return $urls;
        }

    $a = tidy_parse_file($_SERVER['argv'][1]);
    $a->cleanRepair();
    print_r(dump_nodes($a->html()));
?>