1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4
5 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
6
7 table = DB[:mytable]
8 time = Benchmark.measure do
9 (1..10000).each do |n|
10 table.insert(:a => n, :b => 2*n)
11 end
12 end
13
14 puts time

How much time do you think this is going to take? This is the most inefficient way to insert many rows to a database. Remember each call to #insert will do a round trip to the database. Even if your round trip time is 1ms, you’ll still pay for 10 seconds of round trip time, time which your program could be doing something much more useful, such as generating revenue (somehow).

Instead, you should use the bulk copy feature of your database engine. In my case, that’s PostgreSQL. Since I’m using JRuby, I have to turn to the JDBC world, but that’s all right: everything has been implemented already, by someone, somewhere. I’ll refer you to the relevant pages:

And the relevant code would be:


1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4 require "java"
5
6 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
7
8 time = Benchmark.measure do
9 DB.synchronize do |connection|
10 copy_manager = org.postgresql.copy.CopyManager.new(connection)
11 stream = copy_manager.copy_in("COPY mytable(a, b) FROM STDIN WITH CSV")
12
13 begin
14 (1..10000).each do |n|
15 # Don’t forget we’re streaming CSV data, thus each row/line MUST be terminated with a newline
16 row = "#{n},#{2*n}\n".to_java_bytes
17 stream.write_to_copy(row, 0, row.length)
18 end
19 rescue
20 stream.cancel_copy
21 raise
22 else
23 stream.end_copy
24 end
25 end
26 end
27
28 puts time

This will execute a single round trip to the database server: you’ll pay the latence cost only once.

On an unrelated note, this is the first time ever I use an else clause on a begin/rescue. If an exception is raised, we want to cancel the copy (the rescue clause), but on the other hand, if nothing is raised, we want to end the copy (the else clause). One or the other must happen, but not both.

If you’re curious what difference bulk copying makes, here are the benchmark results:

10000 INSERT statements
  7.012000   0.000000   7.012000 (  7.012000)

1 COPY FROM STDIN statement
  0.848000   0.000000   0.848000 (  0.848000)

The numbers speak for themselves: 8× faster. Not too shabby, and remember this ratio will simply increase as the number of rows increases.

It’s not that hard, but it still took me 2 hours to do it. I had a couple of false starts, and I pored over documentation for a while until I hit upon the excellent Nux library.

I won’t let you go through the same failures I had. Here’s the code:


1 # Demonstrates how to parse a local HTML document using XOM,
2 # TagSoup and Nux, under JRuby.
3 #
4 # http://www.xom.nu/
5 # http://home.ccil.org/~cowan/XML/tagsoup/
6 # http://acs.lbl.gov/nux/
7
8 include Java
9 mydir = File.expand_path(File.dirname(FILE))
10
11 # This is how you require libraries without touching your
12 # CLASSPATH from JRuby. I put the required files in vendor/.
13 # Nux includes it’s dependencies (XOM and saxon), so I didn’t
14 # have any other libraries to add.
15 require File.join(mydir, "vendor", "tagsoup.jar")
16 %w(nux.jar saxon8.jar xom.jar).each do |filename|
17 require File.join(mydir, "vendor", "nux", "lib", filename)
18 end
19
20 import "org.ccil.cowan.tagsoup.Parser"
21 import "nu.xom.Builder"
22
23 builder = Builder.new(Parser.new)
24
25 # XOM’s Builder expects a full URL, so tell it where to find the
26 # document.
27 doc = builder.build("file://#{File.expand_path(File.join(mydir, ARGV[0]))}")
28 puts doc.toXML

Extra! Add XPath querying

Continuing from above, you can add XPath querying:


1 import "nux.xom.xquery.XQueryUtil"
2
3 # Must use ‘*:p’. ‘*’ stands for any/default namespace.
4 results = XQueryUtil.xquery(doc, "//*:p")
5 p results.size
6 results.size.times do |index|
7 puts results.get(index).toXML
8 end

Why am I going through these motions? Because I wanted to use my 20% for fun. Besides, I need to process large quantities of HTML as quickly as possible for a cool project I’m working on, and JRuby seems to be the fastest implementation, according to my unscientific benchmark.

But the real reason was that both Nokogiri and Hpricot wouldn’t load/run under JRuby 1.2.0.

Actually, let me rephrase that: Nokogiri did load, but crashed while requiring the library:


1 $ jruby w test.rb data.html
2 /Users/francois/Library/Java/JRuby/jruby
1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:180: undefined method `next_sibling’ for class `Nokogiri::XML::Node (NameError)
3 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml/node.rb:31:in `require

4 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require’
5 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:3
6 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri/xml.rb:31:in
`
require
7 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require

8 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:10
9 from /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/nokogiri-1.2.3-java/lib/nokogiri.rb:36:in `require’
10 from /Users/francois/Library/Java/JRuby/current/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in
`
require
11 from test.rb:2
12

I have reported this bug to the proper authorities.

Hpricot is another matter entirely. When I tried to use it earlier, I hit a roadblock because JRuby couldn’t install the native extensions. I tried again just now, and if you specify the version to be ~> 0.6.1, it works. Specify any other version, and you’re a sitting duck:


1 $ jruby S gem install -v ~> 0.6 hpricot
2 Building native extensions. This could take a while
3 ERROR: Error installing hpricot:
4 ERROR: Failed to build gem native extension.
5
6 /Users/francois/Library/Java/JRuby/current/bin/jruby extconf.rb install -v ~> 0.6 hpricot
7
8
9 Gem files will remain installed in /Users/francois/Library/Java/JRuby/jruby
1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1 for inspection.
10 Results logged to /Users/francois/Library/Java/JRuby/jruby-1.2.0/lib/ruby/gems/1.8/gems/hpricot-0.8.1/ext/hpricot_scan/gem_make.out
11
12 $ jruby S gem install -v ~> 0.6.1 hpricot
13 Successfully installed hpricot
0.6.164-java
14 1 gem installed
15 Installing ri documentation for hpricot-0.6.164-java…

Search

Your Host

A picture of me

I am François Beausoleil, a Ruby on Rails and Scala developer. During the day, I work on Seevibes, a platform to measure social interactions related to TV shows. At night, I am interested many things. Read my biography.

Top Tags

Books I read and recommend

Links

Projects I work on

Projects I worked on