Lookup Database for Crawling Repositories

Master Thesis - Bachelor Thesis

Thesis description:

A common way of testing software or research prototypes is to use well-known benchmark suites such as the DaCapo suite [1] for example. However, such large benchmarks are hard to create and to maintain, and this work is often done by hand. Moreover, benchmark suites currently only exist to test for specific properties, and are not necessarily adapted to the needs of the tested software.

The Automated Benchmark Management (ABM) methodology [2] has been created to address the shortcomings of current benchmark suites. It aims at automatizing the process of benchmark creation and maintenance, and makes it fully customizable to the user, so that they can create benchmark suites adapted to their use. We are currently building a website [3] to implement the ABM methodology. The current implementation crawls GitHub for open-source, real-world projects containing user-specified features. It allows users to filter out unsuitable projects, and create and update collections from the remaining projects.

Crawling repositories for specific projects can take a long time. In this thesis, you will research methods to build a lookup database containing basic information about repositories that can be quickly queried in order to reduce the initial search time of the collection construction process. You will first research existing approaches for crawling large code bases such as GitHub. You will then research, design, and implement, an approach to generate a database of repositories and of their features, and how to update this database, to reflect changes made in the repositories or in the definitions of the features.

Your work helps future researchers to safely and correctly evaluate their research prototypes. You will actively help to raise the quality of international research with your work on the project.

Skills required:

  • Good understanding of the Java language.
  • Experience with software design and efficient programming.
  • Prior knowledge of web application development is helpful, but not required.


The thesis will be written in English. It can also be conducted as a Bachelor thesis, with a reduced scope.

Learning outcomes:

  • Assimilate and apply knowledge from relevant literature.
  • Plan, implement and document an independent part of a bigger project.
  • Web application development.


Lisa Nguyen, M.Sc. : lisa.nguyen@iem.fraunhofer.de

Dr.-Ing. Ben Hermann : ben.hermann@upb.de


[1]  Stephen M. Blackburn et al. 2006. The DaCapo benchmarks: java benchmarking development and analysis. OOPSLA '06. DOI=http://dx.doi.org/10.1145/1167473.1167488

[2] Lisa Nguyen Quang Do et al. Toward an automated benchmark management system. In SOAP 2016. DOI: http://dx.doi.org/10.1145/2931021.2931023

[3] http://abm.cs.upb.de/abm/index.html#/