How Can I Parse Only Part Of An Html File And Ignore The Rest?
In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?
dataset 1:
Solution 1:
Do you mean the 999th line or the 999th table row?
The former might be
perl -ne 'print if $. == 999' /path/to/*.dat
The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".
Perl has many XML and HTML parsers - did you have any particular module in mind?
EDIT:
Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better
#!/usr/bin/perlusestrict;
usewarnings;
useLWP::Simple;
useHTML::TreeBuilder::XPath;
## replace this with a loop over 5000 existing files#
my $url = 'http://www.kultusportal-bw.de/'.
'servlet/PB/menu/1188427/index.html'.
'?COMPLETEHREF='.
'http://www.kultus-bw.de/'.
'did_abfrage/detail.php?id=04313488';
my $html = get $url;
my $tree = HTML::TreeBuilder::XPath->new();
## within the loop process the html like this#$tree->parse($html);
$tree->eof;
print$tree->findvalue('//table[@bgcolor]/tr[1]');
Try cutting the above and pasting into a file then running it with Perl.
Post a Comment for "How Can I Parse Only Part Of An Html File And Ignore The Rest?"