1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4
5 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
6
7 table = DB[:mytable]
8 time = Benchmark.measure do
9 (1..10000).each do |n|
10 table.insert(:a => n, :b => 2*n)
11 end
12 end
13
14 puts time

How much time do you think this is going to take? This is the most inefficient way to insert many rows to a database. Remember each call to #insert will do a round trip to the database. Even if your round trip time is 1ms, you’ll still pay for 10 seconds of round trip time, time which your program could be doing something much more useful, such as generating revenue (somehow).

Instead, you should use the bulk copy feature of your database engine. In my case, that’s PostgreSQL. Since I’m using JRuby, I have to turn to the JDBC world, but that’s all right: everything has been implemented already, by someone, somewhere. I’ll refer you to the relevant pages:

And the relevant code would be:


1 # JRuby 1.6.0 required
2 require "sequel"
3 require "jdbc/postgresql"
4 require "java"
5
6 DB = Sequel.connect "jdbc:postgresql://127.0.0.1:5432/mydb"
7
8 time = Benchmark.measure do
9 DB.synchronize do |connection|
10 copy_manager = org.postgresql.copy.CopyManager.new(connection)
11 stream = copy_manager.copy_in("COPY mytable(a, b) FROM STDIN WITH CSV")
12
13 begin
14 (1..10000).each do |n|
15 # Don’t forget we’re streaming CSV data, thus each row/line MUST be terminated with a newline
16 row = "#{n},#{2*n}\n".to_java_bytes
17 stream.write_to_copy(row, 0, row.length)
18 end
19 rescue
20 stream.cancel_copy
21 raise
22 else
23 stream.end_copy
24 end
25 end
26 end
27
28 puts time

This will execute a single round trip to the database server: you’ll pay the latence cost only once.

On an unrelated note, this is the first time ever I use an else clause on a begin/rescue. If an exception is raised, we want to cancel the copy (the rescue clause), but on the other hand, if nothing is raised, we want to end the copy (the else clause). One or the other must happen, but not both.

If you’re curious what difference bulk copying makes, here are the benchmark results:

10000 INSERT statements
  7.012000   0.000000   7.012000 (  7.012000)

1 COPY FROM STDIN statement
  0.848000   0.000000   0.848000 (  0.848000)

The numbers speak for themselves: 8× faster. Not too shabby, and remember this ratio will simply increase as the number of rows increases.

If you’re like me, you installed PostgreSQL (or MySQL or whatever) locally for development purposes. Then you copied data into your instance, and along you went.

Eventually, you wrote a migration that touched nearly all rows in your database, and you waited, and waited, and waited some more (grab some coffee!) for the migration to finish.

In my case, I had a migration using pure Ruby which took 30 minutes. The migration ran along the lines of:


1 def self.up
2 add_column :segments, :pathname, :string
3
4 # There are 200,000 segments in the DB
5 Segment.find_each do |segment|
6 segment.pathname = segment.full_pathname
7 end
8
9 change_column :segments, :pathname, :string, :null => false
10 end
11
12 class Segment < ActiveRecord::Base
13 belongs_to :parent, :class_name => "Segment"
14
15 def full_pathname
16 full_name = []
17 m = self
18 while m
19 full_name << m.name
20 m = m.parent
21 end
22
23 full_name.reverse.join(" / ")
24 end
25 end

The intent being that we’d cache the segment’s pathname in each record (think a directory tree). Since we had 200,000 segments, and each of those would walk up the chain, it took around 30 minutes to run that migration. I rewrote the migration to use CTE, which PostgreSQL supports out-of-the-box. My migration now ran in 17 minutes or so. I warned my teammates about this migration, and let it be. Eventually, a colleague took my branch and ran the migration. When he reported the migration took 30 seconds for him, at first I thought his SSD was the major difference between our machines. It turns out it wasn’t…


1 # PostgreSQL installed through Homebrew
2 # /usr/local/var/postgres/postgresql.conf
3 # Your file location will vary
4 # …
5 shared_buffers = 2400kB
6 # …

Changing that single line from 2400kB to 96MB let me run the migration in 30 seconds as well… I was very impressed at my stupidity. Let it be a lesson learned!

What happens when you mix foreign key constraints, PostgreSQL, ActiveRecord and cached counters (search for :counter_cache)? Deadlocks.

My colleague, Mina Naguib, spent some time while away from the office to dig and diagnose some problems. The underlying issue is probably not related to PostgreSQL itself, but cached counters, which Rails makes very easy to do, sometimes result in deadlocks. We’ve seen those backtraces in Hoptoad, but had never really investigated them: they were rare, meaning nobody complained about them. They were just a thorn in our side.

In a nutshell, this flow is problematic:

  1. Adding or modifying a record in a table, when:
  2. That table is a child (using a foreign key contraint) of another table
  3. and then modifying the parent record in the parent table
  4. Using transactions
  5. Concurrently
  6. Sometimes (race condition)

From Mina’s PostgreSQL transactions wrapping child+parent modifications, deadlocks, and ActiveRecord

Read the full text: it’s well worth it. Unfortunately, Mina doesn’t have a resolution, but he wants more people to be aware of the issue, in the hopes that somebody will find a clean solution.

Search

Your Host

A picture of me

I am François Beausoleil, a Ruby on Rails and Scala developer. During the day, I work on Seevibes, a platform to measure social interactions related to TV shows. At night, I am interested many things. Read my biography.

Top Tags

Books I read and recommend

Links

Projects I work on

Projects I worked on