Corpus Distillation Techniques for Effective Fuzzing: A Comprehensive Evaluation
Mutation-based fuzzing typically uses an initial set of non-crashing seed inputs (a corpus) from which to generate new inputs by random mutation. A given corpus of potential seeds will often contain thousands of similar inputs. This lack of diversity can lead to wasted fuzzing effort, as the fuzzer will exhaustively explore mutation from all available seeds. To address this, industrial-strength fuzzers such as American Fuzzy Lop (AFL) come with distillation tools (e.g., afl-cmin) that automatically select seeds as the smallest subset of a given corpus that triggers the same range of instrumentation data points as the full corpus. Common practice suggests that minimizing both the number and cumulative size of the seeds may lead to more efficient fuzzing, which we explore systematically here.
We present the results of over 27 CPU-years of fuzzing with eight alternative distillation techniques to understand the impact of corpus distillation on finding bugs in real-world software. Inspired by previous work—in particular, the MINSET technique—we devise a new corpus distillation technique based on a near-optimal solution to the set cover problem. Our technique, MoonLight, delivers smaller corpora—from a factor of three up to two orders of magnitude—compared to afl-cmin, the industrial standard. Furthermore, we show that afl-cmin is comparatively limited in finding bugs.
In contrast to previous work, we conduct rigorous experimental evaluation of MoonLight, comparing it to state-of-the-art techniques (including afl-cmin and MINSET) on long fuzzing campaigns. We target a diverse set of six common open-source libraries and programs, covering seven different input file formats, and show that distillation is a necessary precursor to any fuzzing campaign when starting with a large initial corpus. Notably, we find that neither MoonLight nor MINSET finds all of the 33 bugs revealed by our extensive fuzzing campaigns. Each technique appears to have its own strengths while also producing smaller corpora than afl-cmin. We find (and report) new bugs with MoonLight that are not found by MINSET, while MINSET also finds some bugs that MoonLight is unable to discover. Afl-cmin fails to reveal many of these bugs. Of the 33 bugs revealed by our campaigns seven new bugs have received CVEs.
I am a professor of computer science the Australian National University, contributing also as a researcher with Data61 (formerly NICTA). I previously spent 22 years on the faculty at Purdue University. I studied computer science at the University of Adelaide, the University of Waikato, and the University of Massachusetts at Ahmerst, receiving BSc, MSc, and PhD degrees, respectively. My research interests lie in the area of programming language implementation, and I work on problems arising in object persistence, object databases, distribution, memory management (garbage collection), managed language runtimes, language virtual machines, optimizing compilers, and architectural support for programming languages and applications.
I am a Life Member of the Association for Computing Machinery and a Member of the IEEE. I was named a Distinguished Scientist of the ACM in 2012.
Mon 21 OctDisplayed time zone: Beirut change
16:00 - 17:30
|NAB: Automated Large-scale Multi-language Dynamic Program Analysis in Public Code Repositories|
Andrea Rosà University of Lugano, Switzerland
|Corpus Distillation Techniques for Effective Fuzzing: A Comprehensive Evaluation|
Tony Hosking Australian National University / Data61
|MadMax and Friends: Program Analysis for Smart Contracts|
Neville Grech University of Athens