--- Log opened Fri Mar 23 00:00:59 2012 04:07 < mcallan> conseo: looking now at your code. (1) DiffKick looks dangerous, because a harvest based on a kick should be no different than any other harvest, otherwise it might not be possible to regenerate the archive by a crawl harvest. to be sure of this, it would be best not to rely on contextual information from the kicker 04:08 < mcallan> (regenerate the cache of the archive) 04:16 < mcallan> (2) you receive a kick. you ignore its forum property, i guess because this is just test code with a hard-coded forum (ok). then it looks like you start a crawl consisting of many scheduled jobs of different types. this seems overcomplicated... or at least i don't see the design yet. 04:44 < mcallan> I think you need a solid design before you get too far into the code. I would start with a simple napkin sketch. Here's my rough attempt: 04:44 < mcallan> (a) Receive kick 04:44 < mcallan> (b) Schedule update job 04:44 < mcallan> (c) Let update job run and schedule further update jobs as needed 04:44 < mcallan> Let's look at the detail of (c), because that's obviously the heart of it. 04:44 < mcallan> (c1) Read local marker recording the last message M0 cached. 04:44 < mcallan> (c2) Find M0 in the remote archive. 04:44 < mcallan> (c3) If M0 is the latest message (no more to read), then quit. 04:44 < mcallan> (c4) Try incrementing local marker to next message M1, or goto (c1) if another job has since incremented it. 04:44 < mcallan> (c5) Read M1 from the remote archive. 04:44 < mcallan> (c6) If M1 contains a diff URL, then cache it. 04:44 < mcallan> (c7) If M1 is the latest message (no more to read), then quit. 04:44 < mcallan> (c8) Schedule another update job. 04:47 < mcallan> conseo: i'll be up in 10 hours or so, and we can discuss 05:06 < mcallan> this is what i meant by sketching the algorithm of a single job. note this design does not depend on the structure of the archive, and includes very few implementation details. the details do not matter a whole lot because they can always be changed after the fact. the design cannot be changed so easily once the code is written, so it's crucial to get it right. not sure this is right, but it's a first stab 17:37 < mcallan> conseo: ping me if you want to skype about this, or scheduling etc 19:54 < conseo> mcallan: i propose: a forward job is not a logical consequence of a kick event. no matter how high the latency for the event is, it tells us something about the newest jobs in the archive not the closest ones to our last harvest marker 19:59 < conseo> except you expect us to background update all archives every 5 minutes or so (?) 20:01 < conseo> i rather thought about an hour, but if i am too conservative here, then it all depends on how high the latency for the kick events is. i assume that 90% of the mails will reach the detector in <5min after they are in the online archive, is that wrong? 20:03 < mcallan> i think you misunderstand the algorithm, forward *always* starts with recent posts, even when harvester is run for first time. i can skype in 5 if u like 20:03 < conseo> ok 20:04 < conseo> ok, then we both mean forward jobs (going from now back to our last marker)? 20:04 < conseo> then we agree and i can do it btw. 20:08 < mcallan> no, update goes from near now to now. want to skype? 20:08 < conseo> why? 20:08 < conseo> which problem does that solve? 20:10 < mcallan> so we can talk and understand each other 20:10 < conseo> hopefully :-) 20:11 < mcallan> turning on the skypilizer... --- Log closed Sat Mar 24 00:00:14 2012