--- Log opened Sat Sep 01 00:00:44 2012 02:29 < conseo> yes 02:31 < conseo> i have summed up the design rationale, besides the javadocs on the wiki. maybe you could have a look? http://zelea.com/w/User:4consensus_WebDe/Harvester/Development/Design 02:32 < conseo> i would like to know if you agree with the harvest-runner state saving design on a per url basis. 02:32 < conseo> i am off for today, cu tomorrow --- Log opened Sat Sep 01 09:10:23 2012 09:36 < mcallan> refreshing my memory. kicks are to be manual: http://mail.zelea.com/list/votorola/2012-June/001388.html 09:37 < mcallan> my understanding of the arch is still this: http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/package-summary.html#package_description 09:37 < mcallan> so you are talking about the state of a harvester 09:38 < mcallan> my last understanding of that was http://mail.zelea.com/list/votorola/2012-May/001375.html 09:39 < mcallan> (sorry, that is no longer relevant. it was for an auto-kicker, not harvester) 09:45 < mcallan> this is all for the pipermail harvester. when it runs, it needs to recall state from the last run, which might have been in a different vm 09:50 < mcallan> http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/PipermailHarvester.html 09:51 < mcallan> its an iteration, walking (backwards i think) through the archive. the state should be fairly simple, a few markers. nothing more 09:52 < mcallan> conseo: i don't understand why it needs a massive database of message urls. i think it just needs a few markers 10:03 < mcallan> a few markers per forum, so it maybe a table keyed by forum is best. pipermail harvester (1) starts fetch for form f, (2) reads row f from state table, (3) does fetch, (4) writes row f, (5) suspends awaiting next fetch request from scheduler 10:05 < mcallan> and the markers are things like (a) earliest message harvested, (b) latest message harvested. understanding is that all messages between (a) and (b) are harvested, and all others are not. that's all 11:16 < conseo> the point is that markers are an unnecessary abstraction. they mean urls anyway. either the harvester knows the timeframe covered by an url and therefore can (instead of adding two markers for the timeframe covered) simply take the url for this timeframe 11:16 < conseo> mcallan: also the 10 million urls are ridiculous and i can even keep the db clean, as i just pictured 11:17 < conseo> for example the url state saving scales with time whenever a fixed time frame (like month for pipermail, irc for day...) is accessible to the harvester 11:19 < conseo> for a 5 year archive this means 12 * 5 = 60 urls in total + the posts from the current month (let's say it is in a busy case on average 200 posts) 11:19 < conseo> makes ~250 posts for 1000 forums = 250.000 urls, but it only grows by one url per month per archive in this fixed time frame case 11:20 < conseo> if the archive only allows us to go backwards (e.g. phpbb forum or some kind of feed), then we scale with these fixed backward intervals (which have to be static and have to have a staic url) 11:21 < conseo> the problem with markers is that we can't just say this is the start and the end of the currently covered archive part, because harvesting can be fragmented 11:22 < conseo> we then need many markers, one for each month in case of pipermail and have to close gaps once the harvesting is really finished 11:22 < conseo> mcallan: i am ready to skype if this is helpful 11:22 < conseo> otherwise i can try to clarify in the wiki or on the list, if this is first needed 11:23 < conseo> the motivation to remove the marker abstraction is to keep state-saving out of the harvester. 11:24 < conseo> the markers might be a better abstraction and save space, but what a harvester in fact does when it schedules further urls in the archive, is excactly whether this url has already been harvested 11:25 < conseo> the only problem i can see is non static archives which change the pov of the harvester which has to adjust urls then dynamically 11:25 < conseo> i haven't seen such an archive yet, because it would not be linkable from the outside (somewhat pointless) 11:29 < conseo> we can change the whole design of an harvester and try to really process sequentially, where each job only directly schedules the next job, but this would not be like an archive is presented in general 11:33 < conseo> (next meaning the directly previous message in the archive or the way to it) 11:34 < mcallan> turning on my skype box... --- Log closed Sun Sep 02 00:00:33 2012