User:Conseo-Polyc0l0rNet/Harvester/Development/Design
Contents |
Server-side
The server side consists mainly of 4 services. These services are responsible to give the user on the web a snappy experience with minimal delay while not getting banned due to abusive web scraping.
Parallel acccess to the API
- Allow to run instances in different JavaVMs in parallel (servlet, utility command line tools, daemon harvesters with live detectors). This is needed because harvesting can be triggered from different instances (e.g. servlet by user)
- Synchronize state between these running instances. Since our main concern is the outside impact of our fetches and we already use Apache's asynchronously designed HTTP-Client which is already optimized (reactor-pattern), we care about consistent data (cache state) and little outside impact.
- The experience with former prototypes has guided us to a backwards harvesting process, which should be possible in any archive by a stable entry point and a progress marker in the archive to stop this linear process. Currently harvesters are modelled based on the assumption that such a stable path is available.
The services then allow the harvester to focus on the harvesting process according to the kicks it receives, while state is tracked in the services unless the harvester needs an additional state, which it can implement in the db (for quick synchronization between different processes) itself.
A previous concept has assumed state saving for each harvester, which is then not standardized in the scheduler as each harvester is free to use the services how it wishes (including programming language). Once we try to run this in several instances though, we need it to be implemented in a secure way, stopping itself from double crawls. Instead of saving the state in the harvester, I have therefore come to the conclusion, to save state along the backward path (the core state of each forum in the harvester) in the scheduler (HarvestRunner) itself, and make that state queryable even for millions of urls.
What happens now, is that the harvester decides whether the job-url is static or not and the scheduler will remember it then. Since this is true for the directory browsing urls (assumed), e.g. browsing the archive on the timeline by month (like in pipermail), rescheduling of these months is avoided once it is remembered by the scheduler. Every harvest stops then automatically once it reaches covered terrain, which reaches in small steps into the present finally (by each post of the current time frame, in pipermail a month).
HarvestCache
To locally cache the data we need to store it. This data is queried from the web (HarvestWAP), therefore we use Votorola's database (from a running Voteserver). A double insertion to the database should be impossible and logged as it is a hint to misbehaving harvesters.
HarvestRunner
The scheduler needs to save state consistently between different parallel instances, avoiding double scraping whenever possible. Because we need to quickly decide whether to harvest or not, we will store all harvested urls, which are known to be static to the harvester, in an optimized table (hashmap in the db).
Scheduling
Jobs get sorted by base-url of their forum. For regularly scheduled jobs, one url is fetched per second per forum. We don't do any load balancing on per host basis, as we assume that servers with more archives will take more load from public crawlers.
Quick updates
In bursting mode a forum is updated as fast as possible without delays. This can stress a single host for many forums in bad circumstances. At the moment we ignore this problem, because we need quick updates. We can always lower this problem by background updates lowering the gap for bursts.
URL-history state saving in Hash-Map
We calculate a hash for each url with SHA and take 8 bytes = 2^64 values of it. By only operating on this hash column (bigint) of all urls (which are still saved to the db in a separate column), we can store the information of existence of each url in 8 bytes. Let's say we have even 10 million harvested urls, then this makes ~ 80 megabytes of table data in memory, which is still bearable in main memory and reasonably fast to query. Note that we don't even need all historical urls in main memory for this to work, as we can allow the admin later to mark forums as closed or "historical data" and take them of the schedulers index that way. We might be able to automate that as well on reading wiki definitions of the forums during startup (mark the forum as finished there).
Harvesters can also mark indexes of whole parts of archives as finished if they access them as historical data and therefore only generate lots of urls during long runtime url up-dates only. This means that a command line interface call to the harvester to reharvest an archive can reduce this data almost in the same amount as it arises at least for hierarchical forums.
Authenticator
The Authenticator is a rather small part. It offers an interface for harvesters to map the message's sender to one of the two drafters compared in the difference. It potentially allows any kind of lookup, for now allowing the user to configure different nicknames on the wiki.
Kicker
The Kicker is mostly a messaging interface to all harvesters. It broadcasts kicks (e.g. config-kicks) to all harvesters, while for special update-kicks, we assume a common attribute for the base-url of the forum. Harvesters subscribe to each forum they learn about in config-kicks. Each forum is supposed to be subscribed by only one harvester. That way Harvesters don't have to deal with the volume of unrelated events and the kicker can quickly hash-map the data.
Client-side
The client side interface is a track in the trackstack. It is supposed to show minimalistic icons about the discussions going on.
Risk evaluation
Unsynchronized access and external load
I assume that in a worse case, two jobs in different JavaVMs can have a race condition against the db, where each schedules the next job (not yet committed to db, because not run), Even in this case, the state of the url is checked again before and again after being fetched before the job (potentially causing further jobs) is run. This synchronizes the time frame to the DB which reduces the time until the message is committed to the computation done by the Harvester on the message. Since this time frame is usually a lot smaller than network delays, the race condition should end after a while. This only happens if an update would be triggered at the same time in both VMs and only for new messages in the archive. Therefore I assume the risk being very low of race conditions causing serious load. A parallel burst for a very outdated cache could quickly scale up though. Internally the race condition has no effect, because client data is available when the first job commits to the DB.
Corrupting DB data
Since all messages are only stored once by DB design and an additional putting to HarvestCache has no effect, each instance should always be in a consistent even if not up-to-date state. We achieve this by chosing a stable path (archives are static while some forums may allow re-edits) for the harvesting process and by storing and syncing state directly at the very place it changes. This is done by the harvester itself committing finished urls to HarvestCache.