3章その16 hpricotでHTMLパース

p.49より、HTMLの解析。
pythonではBeautiful Soupを使ってHTMLを解析しているが、RubyなのでHpricotを使う。

hpricotは既に3章序盤でHTMLタグ除去にも使ったが、今回はタグの中身を取り出しに深入りしていく。

3章その2 - 橋本詳解
pylori*style wiki - HTMLパーサ Hpricot
AnHpricotShowcase on Hpricot

インストール

gem install hpricot

p.49の http://kiwitobes.com/wiki/Programming_language.html のリンクを読むのと同じ内容のコード。
hpricot-test.rb

#!/opt/local/bin/ruby

require 'rubygems'
require 'kconv'
require 'open-uri'
require 'hpricot'

url = 'http://kiwitobes.com/wiki/Programming_language.html'
page = open(url).read().toutf8 # ページを読み込む
doc = Hpricot(page)

links = doc/:a # aタグでsearch
puts links[10]
puts links[10][:href]
puts links[10].inner_html

結果

<a href="/wiki/Algorithm.html" title="Algorithm">algorithms</a>
/wiki/Algorithm.html
algorithms