--- Log opened Thu Jul 28 00:00:07 2011
22:10 < conseo> mcallan: are you there?
22:39 < conseo> i have thought about how i should approach scraping (or better in general url calculation) in a way that is future proof and adjustable and have thought about doing it roughly like that: http://wstaw.org/m/2011/07/29/Harvester.png
22:41 < conseo> at the moment we store the data for mail in the maildir/imap, but we listen live to irc. this means we cannot recalculate the database data if we corrupt it. it also means that we cannot extend the difference feed functionality in a backwards compatible way if we need something more than is stored in the db
22:42 < conseo> e.g. there might be some information in the message which will be needed to generate an update url for it
22:42 < conseo> if we miss it then we cannot link to the resource anymore
22:42 < conseo> s/resource/original post/
22:43 < conseo> this also means we can savely scrape/generate the url as we can easily replay data-to-update anytime and fix the db
22:44 < conseo> i can also post that to the list, just thought you would be here and i could talk to you about it
22:48 < conseo> alternative is to not store the raw data and just try to store the data as good as we can in the db directly and then regenerate urls if they miss from the servlet in the db via the scriptable (scraping or non scraping) config routine
22:49 < conseo> but if we generate the url on request from the servlet we might get problems there if scraping takes too long
23:11 < conseo> we can also write a url update utility which checks the urls validity and then calls the scriptable scraper seperately from the rest
--- Log closed Fri Jul 29 00:00:23 2011