This page references "BlueBerry", an IBM-internal project I created in 2007. Designed to provide a search interface across multiple databases using commodity hardware, BlueBerry made unique use of over 100 surplus IBM ThinkPads. Consult the links below for more information.
| How it Works - Software - BlueBerry. |
|
|
This page contains a description of the various software systems that make up the BlueBerry application. Please consult the Design section for details on rationale behind current hardware and software architecture. General Application Architecture, Software Components At it's core, BlueBerry is a simple text search engine that scores the results and displays them on a web page. Searching more than 1.5GB of data amongst 80 laptops, scoring the results, and making it fast enough to be useful is a more complex task. To accomplish this task, BlueBerry uses only open source software - Fedora Core Linux, Apache, Grep and Perl- to enable a minimal administrative software stack and deliver high performance. The BlueBerry software system is most easily understood by looking at each function as one of three components: Query and Record Processing, Failure management, or Data extraction. Query and Record Processing Let's follow a typical query through the BlueBerry system for a brief tour of the various programs that make up this software component. In the search box below, enter the query 'harrington'. After the search button has been pressed, the Perl module associated with a web search is passed the search parameters by the Apache web server. The Perl module looks for cached results for the query, and if none exists, passes the query to the BlueBerryQueryController Perl program. The BlueBerryQueryController program is the single interface to the processing node cluster. Every query and result passes through this program to ensure processing priority and completeness. After sending the query out to each node, the program will wait for a response of serial numbers and associated scores. The BlueBerryQueryController program then sorts the returnedlist, and sends requests for record summaries (with keyword highlighting) to each of the nodes. The record summaries are retrieved and then displayed on the web page to the user. Each node that processes a query has it's own portion of the BlueBerry database to search. Typically around 20MB in size, the node's operating system is smart enough to automatically load the file into RAM after an initial access by the Grep program. Subsequent searches are therefore very rapid, as no data is accessed from disk after the initial load. For a simple query, like 'haringtn' the entire process described above will take approximately 1.5 seconds. For more complex queries, the results will take longer, up to 45 seconds for a query like 'IBM'. Note that once a query has been performed, repeated queries will be very fast as the results have been cached. At any time during this process, a node may suffer a hardware failure such as a bad disk drive or failed ethernet connection. Node operating system failures are also common when the hardware fails. The failure management section below describes how these malfunctions are handled. Failure management With a mean age of approximately 5.7 years, the BlueBerry processing cluster is highly prone to hardware failure. Managing these failures is primarily done by detecting a malfunctioning node and replacing it with a 'hot spare'. The central BlueBerry machine runs a heartbeat client, and each of the processing nodes runs a heartbeat server. Once every few seconds, the heartbeat client attempts a connection to a processing node. If the connection fails, or reports a hardware error on the machine, the machine is replaced by a 'hot spare', and the appropriate data files copied to the new node. The master node list is then updated, and commands are sent the to BlueBerryQueryController and other programs to reload their available processing node list. These failures can happen at any time during the query process, so the BlueBerryQueryController and other programs are designed to timeout and fail gracefully if a node cannot be reached. Data Extraction The primary unique resource besides the code to process the data is the data itself. The BlueBerry datastore is a conglomeration of many distinct databases, linked together for linkage by a single unique identifier - employee id (UID). The data extraction portion of the BlueBerry code uses shell scripts, link grabbers, and other automated tools to extract data from web sites, spreadsheets, lotus notes databases, and other sources. Beta Limitations To ensure optimal performance for all users, this Beta release (20070726) of BlueBerry limits results to 100 records. All of the records are searched, but only the top 100 results are displayed. |