public class PipermailHarvester extends Object
A harvester implementation for pipermail archives. Pipermail archives are generated by mailman. Harvested messages are stored in HarvestCache and fetchers are scheduled with HarvestRunner.
Basically we do a depth first search sorted by date. Web-view on the remote archive for this harvester:
Tree of linked pages. Level Example remote archive URLs r1 root http://mail.zelea.com/list/votorola/ / \ / \ / \ / \ i2 i3 index 2010-Jan/date.html / \ /|\ / \ / | \ m4 m5 m6 m7 m8 message 2010-Jan/003321.html
All fetchers depend on one HTTP request. Runtime steps for these Fetchfetchers:
Result after r1: i2-i3, after i2: m4-m5-i3 after i3:
m6-m7-m8.
The state is save by anonymously extending the last Messagefetcher like m5 or
m8 appropriately.
Note: This is only an example. You can submit your Fetcher differently and you can also save state differently. Usage of the HarvestRunner is recommended for graceful I/O handling though.
Modifier and Type | Class and Description |
---|---|
(package private) class |
PipermailHarvester.Archive
Structure to save state of a single archive.
|
(package private) class |
PipermailHarvester.ContextFetcher
Provides update context.
|
(package private) class |
PipermailHarvester.MessageFetcher
A fetcher for a single page, like "2010-Jan/003882.html".
|
(package private) class |
PipermailHarvester.MyKickReceiver
|
(package private) class |
PipermailHarvester.PeriodFetcher
An index fetcher for one period like "2010-January/date.html" or
"2012/date.html".
|
(package private) class |
PipermailHarvester.RootFetcher
An index fetcher for the whole pipermail archive, like
"http://mail.zelea.com/list/votorola/".
|
(package private) class |
PipermailHarvester.UpdateContext
Context to run fetchers and synchronize state during an update.
|
Modifier and Type | Field and Description |
---|---|
(package private) Map<String,PipermailHarvester.Archive> |
archives
Tracks all configured archives.
|
static Pattern |
PAT_AUTHOR
Pattern to find author email in PipermailHarvester.MessageFetcher.
|
static Pattern |
PAT_AUTHOR2
Pattern to parse author email in PipermailHarvester.MessageFetcher.
|
static Pattern |
PAT_CONTENT_END
Pattern to find the end of the body of the message in
PipermailHarvester.MessageFetcher.
|
static Pattern |
PAT_CONTENT_START
Pattern to find the start of the body of the message in
PipermailHarvester.MessageFetcher.
|
static Pattern |
PAT_INPUTENC
Find input encoding.
|
static Pattern |
PAT_LANG
Parse language.
|
static Pattern |
PAT_LISTINFO
Pattern for list-info page.
|
static Pattern |
PAT_PERIOD
Pattern to parse sub-list of posts for each month in
PipermailHarvester.RootFetcher.
|
static Pattern |
PAT_POST
Pattern to parse list of posts in this month list in
PipermailHarvester.PeriodFetcher.
|
static Pattern |
PAT_SENTDATE
Pattern to parse sent date in PipermailHarvester.MessageFetcher.
|
Constructor and Description |
---|
PipermailHarvester()
Construct a new harvester for pipermail.
|
public static final Pattern PAT_LISTINFO
public static final Pattern PAT_INPUTENC
public static final Pattern PAT_PERIOD
public static final Pattern PAT_POST
public static final Pattern PAT_AUTHOR
public static final Pattern PAT_AUTHOR2
public static final Pattern PAT_SENTDATE
public static final Pattern PAT_CONTENT_START
public static final Pattern PAT_CONTENT_END
final Map<String,PipermailHarvester.Archive> archives
public PipermailHarvester()