DBLP FAQ: Which software is behind DBLP?

DBLP has grown from a small scale server intended to test Web technology and to serve only a small local community to a Web site used by thousands of people worldwide. Nevertheless we still run it with a minimal amount of software. There is no database management system behind DBLP, the information is stored in more than 125000 files. The programs used to maintain DBLP are written in C, Perl and Java - they are glued together by shell scripts. The "production" of DBLP runs on several UNIX machines (SunOs, Solaris, Linux). The CDROM version of DBLP which is distributed with the ACM SIGMOD Anthology runs under Windows 95/98, Windows NT, MacOS and UNIX.

The initial DBLP server was a small collection of tables of contents (TOCs) of proceedings and journals from the fields of database system research and logic programming. The TOCs were typed in directly in the HTML format and connected to a few introduction pages by handcrafted links.

The next idea was to generate "author pages". An author page lists all publications (co)authored by a person, for an example look at the page of Stefano Ceri. The generation of these pages works in two steps: In the first step all TOCs are parsed. The TOCs were typed in using a standardized format which make parsing very simple. The HTML parser from an early version of xmosaic was combined with a simple finite automaton which identfies volume, number, author, title and page fields within the HTML text. The parser prints all bibliographic information into a huge single text file ("TOC_OUT") using a line-oriented format similar the refer format. After all parsing has been done a second program (mkauthors) is started. It reads TOC_OUT into a compact main memory data structure, produces the author pages, a list of all author pages and the file AUTHORS which contains all author names.

The files AUTHORS and TOC_OUT are the inputs of author and title, two CGI-programs to search DBLP. The C written programs work "brute force" - they do a sequential search for each query.

The mkauthors program and the search engine produce HTML in which all author names are linked to the corresponding author pages. To get such links into the TOCs, we used a modified version of the parser. This program adds the links to the original TOC page if the corresponding author pages are available. The program was started after each run of mkauthors.

We intended to include annotated bibliographies and reading lists for seminars and courses into DBLP. To make this feasible, a simple HTML preprocessor was written. mkhtml replaces the tag

<cite key="...">

by the bibliographic information from DBLP. The mechanism is very similar to the \cite{...} of LaTeX.

When implementing mkhtml we decided to separate the bibliographic records from the TOCs. For each paper a small file with the essential data is stored in a file system subtree (.../dblp/publ). BibTeX would be an obvious format for these files, but to parse BibTeX is hard and we had no BibTeX parser. We reused the HTML parser and defined tags for the BibTeX record types and field names. Our bibliographic records look like

<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte Röck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>

A few years later the XML standard appeared. Our bibliographic records fit into the XML framework. In a software lab undergraduate students configured a Java XML parser available on the internet to read in all records. After correcting some minor typos not seen by our parser, the experiment run successful.

To avoid redundancy (and maintainance problems) we now use the mkhtml preprocessor for the TOCs, too. A table of contents is just another text citing papers. The <cite ...> tag has a style attribute to control the appearance of the citations. The mkhtml program generated links for all author names, if the author pages exist. It knows a few additional tags to produce footers and logos. Links inside DBLP are marked by a special tag, mkhtml checks the availablity of the destination URL. The source of a TOC file looks like

<html><head><title>IEEE Database Engineering Bulletin,
Volume 5</title></head> <body bgcolor="#ffffff" text="#000000" link="#000000"> <logo>
<h1><ref href="db/journals/debu/index.html">IEEE
Database Engineering Bulletin</ref>,
Volume 4</h1><hr>
In 1981 the IEEE-CS Technical Committee on
Database Engineering decided to turn Database
Engineeing from a short newsletter into a
theme-driven magazine.
<h2>Volume 4, Number 2, December 1981</h2>
Special Issue on Database Machines
<ul>
<li><cite key="journals/debu/Kim81">
<li><cite key="journals/debu/Song81">

<li><cite key="journals/debu/Hsiao81">
<li><cite key="journals/debu/BoralD81">
<li><cite key="journals/debu/Ubell81">
<li><cite key="journals/debu/Hawthorn81">
<li><cite key="journals/debu/ShawSIHWA81">
<li><cite key="journals/debu/YaoTS81">
<li><cite key="journals/debu/AroraD81">
</ul>
<footer>

For most TOCs we "reconstructed" the source files using simple Perl scripts. The parser to produce the TOC_OUT file was replaced by another parser which collects the information from the bibliographic records.

The most productive way to enter bibliographies for journals and proceedings is to type in complete tables of contents and not scattered bibliographic records.

...

(still incomplete)