### Quick access

# Algorithms for Data Streams

## Goals

The increasing bandwidth of modern computer systems has led to the phenomenon of massive data sets occuring in the form of data streams. Examples for data streams include IP traffic logs, data arising in sensor networks, or webcrawls. A few years ago, the total number of credit card transactions per month was considered as a massive data set. This amount of data is comparable to the number of data packets sent by a single router over a single port in one hour. However, today one would like to analyze the traffic of a network that consists of many routers with many ports.

In many cases, the data sets are too large to be analyzed with traditional algorithms, because one can hardly store the whole data and random access to the stored data is too time-consuming. Hence, what we need are algorithms that are able to process the input in one or just a few passes over the data and that do not store the whole data but only a sketch of the data. These algorithms are called data stream algorithms.

The goal of this project is the development of algorithms to be able to analyze data streams. Our project is divided into three parts:

- Clustering and Facility Location in data streams
- Analysis of huge graphs (webcrawls, etc.)
- Algorithms for geometric data streams

Clustering is the partitioning of a given input into subsets of equal characteristics. These subsets are usually called clusters and ideally consist of similar data records that are dissimilar to data records in other clusters. Such a partioning represents the structure of the input and can be used to reduce the amount of data (e.g. each cluster could be represented by one data record of the cluster). This way, one can use clusters as a coarse representation of the data. As a result, clustering is an important tool for the analysis of data streams.

In the analysis of huge graphs, we want to determine characteristic numbers like the so-called clustering coefficient. The clustering coefficient is closely related to the number of triangles in a graph. Furthermore, we are interested in the structure of cocitation graphs. In the cocitation graph for a directed graph G, there is an undirected edge between two vertices u and v, if a vertex w in G exists that has a directed edge to u and another directed edge to v. We assume that we get the directed graph G as a data stream.

The third part of our project deals with known open problems in dynamic geometric data streams (Matching, bichromatic Matching, etc.). We are also interested in a further analysis of the data stream modell enhanced with a sorting primitive. This model can be usefull for geometric problems.

The project is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG). The projekt starts with Christian Sohler in Paderborn and is currently managed by him at the University of Dortmund.

## Members

- Christian Sohler (Project Manager, formerly University of Paderborn, currently University of Dortmund)
- Christiane Lammersen (formerly University of Paderborn, currently University of Dortmund)
- Morteza Monemizahdeh (formerly University of Paderborn, currently University of Dortmund)