--- Log opened Thu Mar 22 00:00:43 2012 14:29 < mcallan> conseo: i saw some of your crawls hit the list. how goes the battle? 14:40 < conseo> mcallan: had a weird bug, but i have now found a mashup of the httpcore-nio examples which actually posts a HTTP GET asynchronously 14:41 < conseo> it took me the last week (with some interruptions) to get the idea of how to build the reactor pattern with httpcore, i hope i can progress faster now 14:48 < mcallan> i've had some learning to do myself. mine is a smaller job though and it looks like i'm on the home stretch. if it tests okay today, then i'll clean up the wiki pages and release it. 14:53 < conseo> mcallan: ok. what have you done? 15:03 < mcallan> http://zelea.com/project/votorola/_/javadoc/votorola/s/gwt/stage/Stage.html 15:05 < mcallan> what i'm testing now isn't documented there: the relay of state across active links. so if you navigate from the bridge (say) to a draft, then any track (or other component like the footings) may optionally initialize the stage such that it automatically selects the same difference as shown on the referring page (bridge or whatever). that diff is then shown on all tracks etc. sort of like here: http://zelea.com/w/User:Test-bg-ZeleaCom/G/p/sandbox#4176-4168 15:06 < mcallan> but it is done without a fragment (#), because that won't work when we have dozens of state variables, aside from diff key 15:12 < conseo> hmm, so you basically make the javascript environment stateful over all pages?... i need to have a look 15:12 < conseo> i only remember sessionid from very old php times (was php 4 i think... :-) ) 15:13 < mcallan> yes, so it's like the stage was part of the browser, and not a single page (but it's all done on the client side, where possible - no storage on server) 15:15 < conseo> ok, sounds nifty, i need to dive in a bit to understand what you do 15:15 < mcallan> i will post a demo, hopefully tonight or tomorrow 15:19 < conseo> ok 15:31 < conseo> mcallan: hmm WP_D looks interesting, where does setCacheable(boolean) refer to? 16:19 < mcallan> http://zelea.com/project/votorola/_/javadoc/votorola/a/web/wic/VPageHTML.html#setCacheable(boolean) 16:19 < mcallan> the bridge is a stateless page, so it's cacheable 16:20 < mcallan> (most of our wicket pages are cacheable) 17:05 < conseo> ok. what is the concept. stateless on the client? do you use html5 storage functionality? 17:10 < mcallan> yes, session store to persist state of single page for back and forth nav. but cookie for passing state across links 17:11 < conseo> so you can access the session store with the help of the cookie beyond cross-domain restrictions? 17:12 < conseo> (i mean single origin policy) 17:12 < conseo> mcallan: it schedules and runs:-) 17:13 < mcallan> no, session store and cookie are used for 2 different purposes. what session store is used for is documented here: http://zelea.com/project/votorola/_/javadoc/votorola/s/gwt/stage/Stage.html 17:13 < mcallan> cookie store is for relaying state *between* pages in a link relation, and i will doc that today 17:15 < mcallan> neither will work in all cases, and we'll have to resort to server side storage. i will code that for next release, and we'll add the stage to metagov's wiki 17:15 < mcallan> (that will require server side store, i think) 17:15 < conseo> ok 17:16 < mcallan> all of this complexity is hidden from tracks and other props on stage. they just have to obey the api 17:16 < conseo> mcallan: will you remove the old navigation from the pollwiki (the nav on the right side)? 17:16 < mcallan> yes, tonight 17:17 < conseo> huuh, tonight is the night :-D 17:17 < mcallan> gonna be alright :-) 17:18 < conseo> i have to figure out how to properly scatter future crawls so they don't lump 17:18 < mcallan> you mean how to ensure the minimum gap (5 s or whatevr) is respacted 17:19 < conseo> well, it doesn't have to be respected absolutely, only on average 17:20 < mcallan> you sure? 17:20 < mcallan> methought that wuz whole purpose of scheduling 17:21 < conseo> the purpose is to be gentile, but we can do two crawls in a second and then wait for 10 or more imo 17:21 < conseo> we will have bursts anyway, when we get a DiffKick event 17:21 < mcallan> you sure that's ok? 17:21 < conseo> because waiting 5 secs for 2-3 calls will break your 10 sec maximum delay 17:22 < conseo> don't know 17:22 < conseo> i don't know if even 5 secs is ok 17:22 < conseo> this depends on the admin 17:22 < mcallan> i think ww'll get our asses blacklisted if we don't obey robots.txt 17:22 < conseo> i would even do it slower in the background and only do bursts when absolutely necessary 17:23 < mcallan> bursts should never be necessary 17:23 < conseo> they are, at least two requests are necessary for pipermail 17:23 < conseo> or we can't predict how long it takes... 17:25 < conseo> well, i can you restrict that in robots.txt? looking... 17:25 < mcallan> this is why i wanted to see your algorithm for a single scheduled job: what it does 1, 2, 3 - because it *can* be scheduled to avoid hitting server in ways that would disobey its robots.txt 17:26 < conseo> ok, i will show you. the work now was necessary to understand how an asynchronous http reactor pattern works. i should be able to document it now 17:26 < conseo> ok 17:26 < conseo> the parameter is called "Crawl-delay" 17:28 < mcallan> hmmm it looks non-standard. i think if you don't hit more than once a second, nobody will complain. later you can obey special robots.txt directives 17:28 < conseo> yes, non-standard. ok 17:28 < mcallan> but you are right, and very short bursts (2 or 3 req's) should be ok 17:30 < mcallan> so this will greatly simplify the structure of your jobs. but when you know you have 2-3 fetches per job, you should probably give that harvester a 2.5 second min gap, right? 17:31 < mcallan> or better yet, since each job must probably schedule the next job, it will calculate the gap based on how many fetches it made 17:32 < conseo> well, you gave me 10 seconds, so i can divide that in any way i like, right :-D 17:32 < conseo> i can also decide on running the jobs to reschedule them later if already one job for this second and this url exists 17:32 < conseo> (hostname) 17:33 < mcallan> (but 10 is prob too extreme, that's one thing. go with 1 s) 17:33 < conseo> i have a stepping of 1s 17:33 < conseo> and then the delay is set in steps 17:34 < mcallan> yes, forget what i said about one job scheduling another... 17:34 < conseo> they do 17:35 < mcallan> in any case, if you need to schedule a job, schedule it for lastJob+gap 17:35 < conseo> at least atm. i will try to present the design to you and then you can criticize it in any way you see necessary 17:36 < mcallan> (but *before* scheduling, you must set lastJob+=gap) ok, again i need to see 1,2,3 for a single job. that is all 17:36 < conseo> yes 17:37 < conseo> will the gap be configurable by the wiki or should i assume a default value of let's say 10s (for background, not burst) 17:38 < mcallan> average of a fetch per second should be okay, i imagine 17:39 < mcallan> no need to config, in future obey robots.txt, that's all 17:40 < conseo> one fetch per second? ok 17:40 < mcallan> sure, that's gentile enough 17:50 < conseo> ok, running that way a test on metagov now 17:52 < conseo> i will document it tonight :-) and post to the list 17:52 < mcallan> ok 20:52 < conseo> mcallan: i have the javadocs in my example pipermail harvester. is this enough? it creates three types of the jobs you can directly read PipermailHarvester to see how they trigger each other to sequentially crawl the archive backwards. 20:53 < conseo> i can also commit this for you 20:59 < mcallan> not sure till i see. more likely too much, rather than too little ;-) 21:01 < mcallan> (crawling backwards is a good idea, no need to wait for the newest) 21:02 < conseo> yes, with irc and pipermail we can simply crawl backwards (although post ids are not consistent completely (due to date distortions) 21:39 < conseo> mcallan: i am commenting all system.out.printlns out now. i would like to commit then. javadoc is uptodate and code is running, although i get no messages, because: 21:39 < conseo> votorola.g.MediaWiki$IDException: No such page revision(s): 2962 ... 21:40 < conseo> i have to work on the Cache still though, i likely have to update something 21:40 < conseo> votorola.a.position.PointerRevision$MalformedPageException: not a proper draft pointer: rev 4026 21:43 < mcallan> conseo: i think there are one or two malformed URLs in the metagov list, from a temporary version of the bridge. maybe that's what you hit? 21:44 < mcallan> i think it was over a year ago, maybe two years 21:44 < conseo> maybe, but the cache is not too critical to document and demo the scheduler. i will have a look at it tomorrow 21:44 < conseo> ok 21:46 < mcallan> sure, as long as nothing is broken for any of the code we actually run on the server, then its ok to commit if need be. i guess your server is still down, and you can't post stuff 21:48 < conseo> mcallan: my server is up and you can pull from there. i don't know what caused the trouble though, still have to check two pci cards 21:49 < conseo> i can push to you directly if you prefer that. 21:50 < conseo> i can also expose just the javadocs, but then pulling and looking at both the docs and the example code is more reasonable imo 21:51 < conseo> is that ok for you or am i causing you more work than necessary? the PipermailHarvester code is really straightforward imo 21:56 < mcallan> you know, i just want 1,2,3 design. if i have to parse code to read it, then i can do that to certain extent. it's probably too much work for you to serve the code and repos the way i do, e.g. here http://zelea.com/project/votorola/ 21:56 < mcallan> and here: http://zelea.com/var/db/repo/ 21:57 < mcallan> i will do whatever you tell me. no need to ask always, just say here it is! 21:58 < conseo> ok. here it is. gn8 :-) 21:58 < mcallan> wait a minute... where? 21:58 < conseo> i'll try to set something similar up, i have a new domain, but it takes some time, because i have dynip and i nee 21:59 < conseo> sry 21:59 < conseo> it is here: sftp://mike@whiletaker.homeip.net//opt/WORK 22:00 < mcallan> don't do extra work for nothing, i don't mind looking on obsidian for your stuff, or whatever 22:00 < conseo> s/sftp/ssh/ 22:00 < conseo> it ran there, but atm. the server is debian stable with java 1.6 and i haven't updated it yet 22:00 < mcallan> i seem to need a password 22:01 < conseo> oh, it is port 2222 22:01 < conseo> (as it has been before my server failure) 22:01 < conseo> same place 22:01 < mcallan> ah, i missed your subst there, and put ftp in my browser :-) 22:02 < mcallan> sure, i will pull and look. you are crashing for the night? 22:02 < conseo> sooner or later, yes. it is already late again :-) 22:03 < conseo> although i am quite happy that it finally works. it gave me some headaches :-) 22:04 < conseo> i have watched a lot of the documentaries from thoughtmaybe and they have all been very good. it was time well spend, thx! 22:06 < mcallan> welcome. i've been watching some too, which is unusual for me. what classes did you say i should look at, or were you gonna post a brief note about that? 22:07 < conseo> nobody except you cares for it atm. just look at PipermailHarvester, from there you automatically get linked to HarvestJob and HarvestRunner as well as HarvestHistory which cover all the necessary functionality 22:08 < conseo> (i hope) i still have to move HarvestHistory directly in HarvestJob, I think now, but i have left that for now 22:08 < conseo> (i mean the usage of HarvestHistory) 22:08 < conseo> i can post though, once we agree that this is reasonable 22:09 < conseo> which ones did you watch? 22:11 < mcallan> many of curtis's. you are crashing right? i can look at the code after supper, and we can hook up again in the am. i am still trying to finish my own work 22:12 < mcallan> or i can look now if you are waiting 22:14 < conseo> nope, i'll go to bed now. we can discuss is it whenever you find the time to have a look 22:14 < conseo> tomorrow sounds fine 22:14 < mcallan> ok, that will be in about an hour 22:14 < conseo> yes the ones from curtis are impressive. i wish we would have watched these at school 22:15 < conseo> ok, but i am wasted. i'd still prefer to go to bed now, if you don't mind 22:16 < mcallan> i meant i will have time to look at it in one hour, and then we talk later 22:16 < mcallan> go to sleep... :-) 22:16 < mcallan> i like this one, next best: http://thoughtmaybe.com/video/the-power-of-nightmares 22:27 < conseo> ok, that one is still on my todo. this is one of my favourites so far: http://thoughtmaybe.com/video/the-century-of-the-self 22:27 < conseo> gn8 and cu tomorrow :-) 22:27 < mcallan> yes, i liked that one too. n8 c --- Log closed Fri Mar 23 00:00:59 2012