wiki:SystemDescription

Math Web Search System Description

MathWebSearch is a complete system capable of crawling, indexing and searching mathematical data. The components are implemented using POSIX-compliant C/C++ and a few third party libraries. The main structure of the system is presented below:

Structure of the new system (MWS-0.4)

The crawler system (crawler) indexes  MathML-rich websites and produces MWS Harvests, based on the Content-enabled m:math nodes it finds. The MWS Harvests are fed into the core which parses them and updates two indexes

  • a fast substitution-based tree for the Mathematical structure and a

 BTree database for the additional information (like URIs+XPaths).

The restful interface (restd) accepts HTTP POST requests with MWS Query data, forwards it to the core for processing and packs the MWS Answer Set received from the core into a HTTP Response back to the user. The HTTP implementation details is left to the  MicroHTTPd library.

The core system (mwsd) deals with processing the Mathematical data and building the indexes. It accepts the MWS XML input formats (MWS Harvest and MWS Query) and generates the MWS XML output format (MWS Answer Set). There are two main use cases:

  • The crawler sends a MWS Harvest to the MathWebSearch Daemon. The XML is parsed and an internal representation is generated. This is used to update the Substitution Indexing Tree and consequently the database.
  • An user sends a MWS Query the MathWebSearch Daemon. The XML is parsed, an internal query is generated. Using an efficient traversal of the Substitution Indexing Tree, formulas matching the search term are used to generate a result. This is translated to a MWS Answer Set and sent back to the user.

Note that the XML parsing and writing are implemented using the  LibXML2 API, while the database software library of choice was  BerekleyDB, as we are dealing mainly with key/value data pairs.

The RESTful interface as well as the MathWebSearch Daemon are multi-threaded applications. The daemon uses a simple one-thread-per-connection policy, while the REST interface can be configured to use either threads or poll/select. Communication between the MWS components is done via TCP/IP.

Attachments