Crash resistence in Ruby/GTK+ applications

Recently I’ve been thinking more about exception handling in Luz in an effort to make it as crash-resistant as possible. (Crashes are just not acceptable in an application that’s used for live performances!)

The challenge to good exception handling is always finding the proper granularity:

  • if you wrap code at too high a level (such as the main loop), there are limited options available for recovery: when you get an Exception, the internal program state is more or less unknown, and simply ‘trying again’ is likely to result in the exact same Exception.
  • if you wrap code at too low a level, you are better able to recover from problems, but you end up writing way too much exception handling code, and turning one line of code into 5+ lines of code with exception handling is just an ugly solution.

So the goal was to add exception handling at the right level, without cluttering up the code (and adding more code paths to test!), while still achieving adequate coverage.

It pays to think about what types of code are involved, and how best to protect each one.

Luz essentially has three types of code:

  • The core engine
  • Plugins
  • GUI signal handling

The core engine handles startup, loops once per frame (calling into plugins), and handles shutdown. This code is relatively short and straightforward, so any problems here will most likely be discovered with even the most casual of testing. (Further, startup and shutdown problems don’t happen in the middle of a performance.)

Plugins can be neatly wrapped by putting exception handling anywhere the engine calls into a plugin. To simplify this, Luz uses the following helper method:

def user_object_try(obj)
	begin
		yield unless obj.crashy?
	rescue Exception => e
		obj.crashy = true
		report_user_object_exception(obj, e)
	end
end

(UserObject is the baseclass of all user-visible objects: Actors, Actor Effects, Directors, Director Effects, Curves, and Themes.)

An example usage:

actors.each { |actor|
	user_object_try(actor) { actor.render }
}

Anything done in the yield block is monitored for exceptions, and any exceptions that do occur are blamed on the given object (even if the Exception came from a call into the core– it is still the object that is the troublemaker).

Notice that when an object throws an exception, we assume it’s broken and will continue to malfunction, so we mark it as “crashy” and we stop using it. (Note that when a developer reloads a plugin we remove the “crashy” flag, giving it another shot at life.)

The third source of potential Exceptions is in the GUI signal handlers. These sorts of bugs bite all the time during development, most often in the form of missing methods or variables. And they are hard to find: requiring human or automated testing of every single menu option, button click, slider drag, etc. in every possible program state.

Wouldn’t it be nice if we could prevent all user actions from crashing the application?

We can. The solution that Luz uses is to wrap all GUI signal handlers at the source, in the signal_connect method:

class GLib::Object
	alias :signal_connect_without_exception_handling :signal_connect

	def signal_connect(signal_name)
		signal_connect_without_exception_handling(signal_name) { |*args|
			begin
				yield *args
			rescue Exception => e
				# tell the user that their command failed...
			end
		}
	end	
end

Now, if a signal handler fails, it acts as if the signal handler didn’t exist, and we can tell the user what happened. This is about the best we can do, given a broken signal handler.

The code above is enough to protect all manually added signal handlers. What’s left are the signal handlers hooked up automatically by Glade.

The Glade library asks for a method object for each signal. To satisfy the request, for each handler, we create a new method that calls the original handler wrapped in exception handling, and return that new method:

glade = GladeXML.new(glade_file_name, root_widget_name) { | handler_name |
	# Create a new method to wrap the actual signal handler, with added exception handling
	# This prevents user actions from crashing the application.
	self.class.class_eval &lt;<-end_class_eval
		def #{handler_name}_with_exception_handling(*args)
			begin
				if method(:#{handler_name}).arity == 0
					self.send(:#{handler_name})
				else
					self.send(:#{handler_name}, *args)
				end
			rescue Exception => e
				puts "Glade signal handler '#{handler_name}' caused exception:\\n"
				puts e.report
			end
		end
	end_class_eval

	# return our new method
	method("#{handler_name}_with_exception_handling")
}

(There may be a cleaner way to write this. Feel free to let me know!)

Using these three methods, we are well protected against the most common sources of Exception crashes, and all without adding a single ‘rescue’ block to the core application.

Update September 13, 2007: These techniques have proven extremely effective in preventing application crashes. Since implementing these changes, Luz hasn’t crashed once due to problems in Ruby code. The few crashes I’ve seen have been bugs in the Ruby bindings for the various libraries Luz uses.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: