It’s not that hard, but it still took me 2 hours to do it. I had a couple of false starts, and I pored over documentation for a while until I hit upon the excellent Nux library.
I won’t let you go through the same failures I had. Here’s the code:
1 # Demonstrates how to parse a local HTML document using XOM, 2 # TagSoup and Nux, under JRuby. 3 # 4 # http://www.xom.nu/ 5 # http://home.ccil.org/~cowan/XML/tagsoup/ 6 # http://acs.lbl.gov/nux/ 7 8 include Java 9 mydir = File.expand_path(File.dirname(__FILE__)) 10 11 # This is how you require libraries without touching your 12 # CLASSPATH from JRuby. I put the required files in vendor/. 13 # Nux includes it's dependencies (XOM and saxon), so I didn't 14 # have any other libraries to add. 15 require File.join(mydir, "vendor", "tagsoup.jar") 16 %w(nux.jar saxon8.jar xom.jar).each do |filename| 17 require File.join(mydir, "vendor", "nux", "lib", filename) 18 end 19 20 import "org.ccil.cowan.tagsoup.Parser" 21 import "nu.xom.Builder" 22 23 builder = Builder.new(Parser.new) 24 25 # XOM's Builder expects a full URL, so tell it where to find the 26 # document. 27 doc = builder.build("file://#{File.expand_path(File.join(mydir, ARGV[0]))}") 28 puts doc.toXML
Extra! Add XPath querying
Continuing from above, you can add XPath querying:
1 import "nux.xom.xquery.XQueryUtil" 2 3 # Must use '*:p'. '*' stands for any/default namespace. 4 results = XQueryUtil.xquery(doc, "//*:p") 5 p results.size 6 results.size.times do |index| 7 puts results.get(index).toXML 8 end
Why am I going through these motions? Because I wanted to use my 20% for fun. Besides, I need to process large quantities of HTML as quickly as possible for a cool project I’m working on, and JRuby seems to be the fastest implementation, according to my unscientific benchmark.
But the real reason was that both Nokogiri and Hpricot wouldn’t load/run under JRuby 1.2.0.
Actually, let me rephrase that: Nokogiri did load, but crashed while requiring the library:
1 $ jruby -w test.rb data.html 2 /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:180: undefined method `next_sibling' for class `Nokogiri::XML::Node' (NameError) 3 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:31:in `require' 4 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require' 5 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:3 6 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:31:in `require' 7 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require' 8 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:10 9 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:36:in `require' 10 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in `require' 11 from test.rb:2 12
I have reported this bug to the proper authorities.
Hpricot is another matter entirely. When I tried to use it earlier, I hit a roadblock because JRuby couldn’t install the native extensions. I tried again just now, and if you specify the version to be ~> 0.6.1, it works. Specify any other version, and you’re a sitting duck:
1 $ jruby -S gem install -v '~> 0.6' hpricot 2 Building native extensions. This could take a while... 3 ERROR: Error installing hpricot: 4 ERROR: Failed to build gem native extension. 5 6 /Users/francois/Library/Java/JRuby/current/bin/jruby extconf.rb install -v ~> 0.6 hpricot 7 8 9 Gem files will remain installed in /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1 for inspection. 10 Results logged to /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1/ext/hpricot_scan/gem_make.out 11 12 $ jruby -S gem install -v '~> 0.6.1' hpricot 13 Successfully installed hpricot-0.6.164-java 14 1 gem installed 15 Installing ri documentation for hpricot-0.6.164-java...