Bye bye, regular expressions!
Thanks to the tidy parser, no longer do you have to rely on regular expressions
to extract content from another HTML document. Rather, you can use a much more
maintainable solution!
In the following example (found in the ext/tidy/examples/ directory), I use the
tidy parser to extract all of the links from an arbitary HTML document
Grabbing URLs from an HTML document
<?php
function dump_nodes(tidy_node $node, &$urls = NULL) {
$urls = (is_array($urls)) ? $urls : array();
if(isset($node->id)) {
if($node->id == TIDY_TAG_A) {
$urls[] = $node->attribute['href'];
}
}
if($node->hasChildren()) {
foreach($node->child as $c) {
dump_nodes($c, $urls);
}
}
return $urls;
}
$a = tidy_parse_file($_SERVER['argv'][1]);
$a->cleanRepair();
print_r(dump_nodes($a->html()));
?>