It’s not that hard, but it still took me 2 hours to do it. I had a couple of false starts, and I pored over documentation for a while until I hit upon the excellent Nux library.

I won’t let you go through the same failures I had. Here’s the code:

 1 # Demonstrates how to parse a local HTML document using XOM,
 2 # TagSoup and Nux, under JRuby.
 3 #
 4 # http://www.xom.nu/
 5 # http://home.ccil.org/~cowan/XML/tagsoup/
 6 # http://acs.lbl.gov/nux/
 7 
 8 include Java
 9 mydir = File.expand_path(File.dirname(__FILE__))
10 
11 # This is how you require libraries without touching your
12 # CLASSPATH from JRuby. I put the required files in vendor/.
13 # Nux includes it's dependencies (XOM and saxon), so I didn't
14 # have any other libraries to add.
15 require File.join(mydir, "vendor", "tagsoup.jar")
16 %w(nux.jar saxon8.jar xom.jar).each do |filename|
17   require File.join(mydir, "vendor", "nux", "lib", filename)
18 end
19 
20 import "org.ccil.cowan.tagsoup.Parser"
21 import "nu.xom.Builder"
22 
23 builder = Builder.new(Parser.new)
24 
25 # XOM's Builder expects a full URL, so tell it where to find the
26 # document.
27 doc = builder.build("file://#{File.expand_path(File.join(mydir, ARGV[0]))}")
28 puts doc.toXML

Extra! Add XPath querying

Continuing from above, you can add XPath querying:

1 import "nux.xom.xquery.XQueryUtil"
2 
3 # Must use '*:p'.  '*' stands for any/default namespace.
4 results = XQueryUtil.xquery(doc, "//*:p")
5 p results.size
6 results.size.times do |index|
7   puts results.get(index).toXML
8 end 

Why am I going through these motions? Because I wanted to use my 20% for fun. Besides, I need to process large quantities of HTML as quickly as possible for a cool project I’m working on, and JRuby seems to be the fastest implementation, according to my unscientific benchmark.

But the real reason was that both Nokogiri and Hpricot wouldn’t load/run under JRuby 1.2.0.

Actually, let me rephrase that: Nokogiri did load, but crashed while requiring the library:

 1 $ jruby -w test.rb data.html 
 2 /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:180: undefined method `next_sibling' for class `Nokogiri::XML::Node' (NameError)
 3         from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:31:in `require'
 4         from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
 5         from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:3
 6         from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:31:in `require'
 7         from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
 8         from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:10
 9         from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:36:in `require'
10         from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in `require'
11         from test.rb:2
12 

I have reported this bug to the proper authorities.

Hpricot is another matter entirely. When I tried to use it earlier, I hit a roadblock because JRuby couldn’t install the native extensions. I tried again just now, and if you specify the version to be ~> 0.6.1, it works. Specify any other version, and you’re a sitting duck:

 1 $ jruby -S gem install -v '~> 0.6' hpricot
 2 Building native extensions.  This could take a while...
 3 ERROR:  Error installing hpricot:
 4         ERROR: Failed to build gem native extension.
 5 
 6 /Users/francois/Library/Java/JRuby/current/bin/jruby extconf.rb install -v ~> 0.6 hpricot
 7 
 8 
 9 Gem files will remain installed in /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1 for inspection.
10 Results logged to /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1/ext/hpricot_scan/gem_make.out
11 
12 $ jruby -S gem install -v '~> 0.6.1' hpricot
13 Successfully installed hpricot-0.6.164-java
14 1 gem installed
15 Installing ri documentation for hpricot-0.6.164-java...