--- Log opened Thu Jul 28 00:00:07 2011 22:10 < conseo> mcallan: are you there? 22:39 < conseo> i have thought about how i should approach scraping (or better in general url calculation) in a way that is future proof and adjustable and have thought about doing it roughly like that: http://wstaw.org/m/2011/07/29/Harvester.png 22:41 < conseo> at the moment we store the data for mail in the maildir/imap, but we listen live to irc. this means we cannot recalculate the database data if we corrupt it. it also means that we cannot extend the difference feed functionality in a backwards compatible way if we need something more than is stored in the db 22:42 < conseo> e.g. there might be some information in the message which will be needed to generate an update url for it 22:42 < conseo> if we miss it then we cannot link to the resource anymore 22:42 < conseo> s/resource/original post/ 22:43 < conseo> this also means we can savely scrape/generate the url as we can easily replay data-to-update anytime and fix the db 22:44 < conseo> i can also post that to the list, just thought you would be here and i could talk to you about it 22:48 < conseo> alternative is to not store the raw data and just try to store the data as good as we can in the db directly and then regenerate urls if they miss from the servlet in the db via the scriptable (scraping or non scraping) config routine 22:49 < conseo> but if we generate the url on request from the servlet we might get problems there if scraping takes too long 23:11 < conseo> we can also write a url update utility which checks the urls validity and then calls the scriptable scraper seperately from the rest --- Log closed Fri Jul 29 00:00:23 2011