--- Log opened Sat Sep 01 00:00:44 2012
02:29 < conseo> yes
02:31 < conseo> i have summed up the design rationale, besides the javadocs on the wiki. maybe you could have a look? http://zelea.com/w/User:4consensus_WebDe/Harvester/Development/Design
02:32 < conseo> i would like to know if you agree with the harvest-runner state saving design on a per url basis.
02:32 < conseo> i am off for today, cu tomorrow
--- Log opened Sat Sep 01 09:10:23 2012
09:36 < mcallan> refreshing my memory.  kicks are to be manual: http://mail.zelea.com/list/votorola/2012-June/001388.html
09:37 < mcallan> my understanding of the arch is still this: http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/package-summary.html#package_description
09:37 < mcallan> so you are talking about the state of a harvester
09:38 < mcallan> my last understanding of that was http://mail.zelea.com/list/votorola/2012-May/001375.html
09:39 < mcallan> (sorry, that is no longer relevant.  it was for an auto-kicker, not harvester)
09:45 < mcallan> this is all for the pipermail harvester.  when it runs, it needs to recall state from the last run, which might have been in a different vm
09:50 < mcallan> http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/PipermailHarvester.html
09:51 < mcallan> its an iteration, walking (backwards i think) through the archive.  the state should be fairly simple, a few markers.  nothing more
09:52 < mcallan> conseo: i don't understand why it needs a massive database of message urls.  i think it just needs a few markers
10:03 < mcallan> a few markers per forum, so it maybe a table keyed by forum is best.  pipermail harvester (1) starts fetch for form f, (2) reads row f from state table, (3) does fetch, (4) writes row f, (5) suspends awaiting next fetch request from scheduler
10:05 < mcallan> and the markers are things like (a) earliest message harvested, (b) latest message harvested.  understanding is that all messages between (a) and (b) are harvested, and all others are not.  that's all
11:16 < conseo> the point is that markers are an unnecessary abstraction. they mean urls anyway. either the harvester knows the timeframe covered by an url and therefore can (instead of adding two markers for the timeframe covered) simply take the url for this timeframe
11:16 < conseo> mcallan: also the 10 million urls are ridiculous and i can even keep the db clean, as i just pictured
11:17 < conseo> for example the url state saving scales with time whenever a fixed time frame (like month for pipermail, irc for day...) is accessible to the harvester
11:19 < conseo> for a 5 year archive this means 12 * 5 = 60 urls in total + the posts from the current month (let's say it is in a busy case on average 200 posts)
11:19 < conseo> makes ~250 posts for 1000 forums = 250.000 urls, but it only grows by one url per month per archive in this fixed time frame case
11:20 < conseo> if the archive only allows us to go backwards (e.g. phpbb forum or some kind of feed), then we scale with these fixed backward intervals (which have to be static and have to have a staic url)
11:21 < conseo> the problem with markers is that we can't just say this is the start and the end of the currently covered archive part, because harvesting can be fragmented
11:22 < conseo> we then need many markers, one for each month in case of pipermail and have to close gaps once the harvesting is really finished
11:22 < conseo> mcallan: i am ready to skype if this is helpful
11:22 < conseo> otherwise i can try to clarify in the wiki or on the list, if this is first needed
11:23 < conseo> the motivation to remove the marker abstraction is to keep state-saving out of the harvester.
11:24 < conseo> the markers might be a better abstraction and save space, but what a harvester in fact does when it schedules further urls in the archive, is excactly whether this url has already been harvested
11:25 < conseo> the only problem i can see is non static archives which change the pov of the harvester which has to adjust urls then dynamically
11:25 < conseo> i haven't seen such an archive yet, because it would not be linkable from the outside (somewhat pointless)
11:29 < conseo> we can change the whole design of an harvester and try to really process sequentially, where each job only directly schedules the next job, but this would not be like an archive is presented in general
11:33 < conseo> (next meaning the directly previous message in the archive or the way to it)
11:34 < mcallan> turning on my skype box...
--- Log closed Sun Sep 02 00:00:33 2012