JRuby, PostgreSQL and Bulk Copy Operations
1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4
5 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
6
7 table = DB[:mytable]
8 time = Benchmark.measure do
9 (1..10000).each do |n|
10 table.insert(:a => n, :b => 2*n)
11 end
12 end
13
14 puts time
How much time do you think this is going to take? This is the most inefficient way to insert many rows to a database. Remember each call to #insert will do a round trip to the database. Even if your round trip time is 1ms, you’ll still pay for 10 seconds of round trip time, time which your program could be doing something much more useful, such as generating revenue (somehow).
Instead, you should use the bulk copy feature of your database engine. In my case, that’s PostgreSQL. Since I’m using JRuby, I have to turn to the JDBC world, but that’s all right: everything has been implemented already, by someone, somewhere. I’ll refer you to the relevant pages:
And the relevant code would be:
1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4 require "java"
5
6 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
7
8 time = Benchmark.measure do
9 DB.synchronize do |connection|
10 copy_manager = org.postgresql.copy.CopyManager.new(connection)
11 stream = copy_manager.copy_in("COPY mytable(a, b) FROM STDIN WITH CSV")
12
13 begin
14 (1..10000).each do |n|
15 # Don’t forget we’re streaming CSV data, thus each row/line MUST be terminated with a newline
16 row = "#{n},#{2*n}\n".to_java_bytes
17 stream.write_to_copy(row, 0, row.length)
18 end
19 rescue
20 stream.cancel_copy
21 raise
22 else
23 stream.end_copy
24 end
25 end
26 end
27
28 puts time
This will execute a single round trip to the database server: you’ll pay the latence cost only once.
On an unrelated note, this is the first time ever I use an else clause on a begin/rescue. If an exception is raised, we want to cancel the copy (the rescue clause), but on the other hand, if nothing is raised, we want to end the copy (the else clause). One or the other must happen, but not both.
If you’re curious what difference bulk copying makes, here are the benchmark results:
10000 INSERT statements
7.012000 0.000000 7.012000 ( 7.012000)
1 COPY FROM STDIN statement
0.848000 0.000000 0.848000 ( 0.848000)
The numbers speak for themselves: 8× faster. Not too shabby, and remember this ratio will simply increase as the number of rows increases.