Processing Firefox Crash Reports With Python

  • by Laura Thomson

    • Web tools engineering manager

    • author of two books:

      • PHP and MySQL Web Development
      • The Surrealists
    • Done about 100 talks!

    • Mozilla is hiring like crazy

Overview

  • The basics
  • The numbers
  • Work process and tools

The Basics

  • Socorro crash information collector thingee

  • Lots of companies use it to track this data:

    • Steam (game stuff)
    • Other things

How crashy is the browser?

Basic Architecure

  • Database is PostGres
  • HBase for map-reduce, she wants to replace it with something else
  • Lots of components powered by Python
  • Front-end is PHP but will be converted to Django in 2012

Lifetime to a crash

  • Browser crashes

  • Sends data to Mozilla in a big binary dump with a JSON header

  • Mozilla processes the header and tries to generate a signature of the crash

    • They need more than just the function that created the crash
    • Doesn’t cover all cases
    • Uses a regex to glean out other things from the binary crash dump

Back end processing

Large number of cron jobs, e.g.:

  • Calculate aggregates: Top Crashers (Farmville if you want to know)
  • Process incoming builds from ftp server
  • Match known crashes to bugzilla bugs
  • Duplicate detection
  • Match up pairs of dumps (OOPP, content crashes, etc)
  • Generates extracts (CSV) for engineers to analyze

Middleware

  • Moving all data access to be through REST API (by end of year)
  • (Still some queries in webapp)
  • Enable other front ends to data and us to rewrite webapp using Django in 2012
  • Upcoming (2011 or 2012) each component will have it’s own API

Webapp

  • Hard parts: How to vizualize some of this data
  • Ex: Nightly builds, moving to reporting in build time, not clock time
  • Code crufty (old KohanaPHP)

Implementation Details

  • Python 2.6 mostly (PHP is the exception)

  • Post Gres 9.1

  • memcache for the webapp

  • Thrift for HBase access

    • HBase is meant to work with Java
    • Could do it in Clojure/Scala but finding resources would be hard
    • Thought about Jython then backed off
    • Considering alternatives
  • 100 users

  • 100 Terabytes of data

Some Numbers

  • At peak 2300 crashes per minute
  • 2.5 million per day
  • Median crash size 150K, max size 20MB (reject bigger)
  • ~110TB stored in HDFS (3x replication, ~40TB of HBase data)

What can they do?

  • Does a version of FF crash more than others?
  • Analyze differences between versions of Flash
  • Detect duplicate crashes
  • Detect explosive crashes
  • Find “frankenstalls” that can happen on Windows
  • Email victims of malware

Implementation Scale

  • > 115 boxes (not cloud cause that won’t cut it)

  • Now 8 devs + sysadmins + QA + Hadoop ops/analysts

  • Deploy approximatelt weekly but could do continuous if they need

Development Process

  • Fork
  • Hard to install (must use VM)
  • Pull request with bugfix/feature
  • Code review
  • Jenkins polls github master, picks up changes
  • Jenkins runs tests, builds a package
  • Package picked up and moved to dev
  • Wanted changes merged to release branch
  • Jenkins builds release branch, manual push to stage
  • QA runs acceptance on stage
  • TODO missing
  • TODO missing

Absolutely Critical!

Build all the machinery for continuous deployment even if you don’t want to deploy continuously

  • You don’t want to install HBase

Upcoming

  • ElasticSearch implemented for better search
  • More analytics; automatic detection of explosive crashes, malware, etc
  • Better queueing
  • Grand Unified Configuration System

Everything is Open Source