Review (NS-97): FR 1691 (supersedes NS-95)

Author

BNF user

Moderator

Søren

State

Review

Objectives

1017:
[FROM NICOLAS @ BNF]
------------------------------------------------------------------------
FR 1691 Configure which Heritrix reports to include in metadata ARC file
------------------------------------------------------------------------
Three new setting properties have been added:
- settings.harvester.harvesting.metadata.heritrixFilePattern is a java pattern that allows to filter which files in the crawl dir (not recursively) to include in the lmetadata ARC.
- settings.harvester.harvesting.metadata.reportFilePattern is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files. All the other files will be considered as setup files.
- settings.harvester.harvesting.metadata.logFilePattern is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC.
NOTE: FR 1691 also addresses Bug 808 (Should store crawl-manifest.txt)
This bug also suggests storing the conf/heritrix_properties, but why not take all files in conf  and modules.

Total Time Used (Coding,Documentation,Review):

Time use (Coding,Documentation,Review)
Nicolas: 2 MD
SVC:  0.5 MD

General comments:

Description

Classification

Status

Comments on file 'trunk/src/dk/netarkivet/harvester/HarvesterSettings.java', revision 1017

Lines

Description

Classification

Status

Comments on file 'trunk/src/dk/netarkivet/harvester/harvesting/HarvestDocumentation.java', revision 1017

Lines

Description

Classification

Status

439

Maybe log here, which logfiles we have found.

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/harvester/harvesting/MetadataFile.java', revision 1019

Lines

Description

Classification

Status

General

We have two patterns for entries in the metadata.arc file, for the cdx-entry in dk.netarkivet.archive.indexserver.CDXDataCache.java , and for the crawl-log entry in CrawlLogDataCache.java I suggest they be moved to the MetadataFile class

Cosmetic

OK

36, 43

Delete @author tag; not used in NetarchiveSuite codestyle

Cosmetic

OK

45

Probably need an new MetadataType called 'index' to fit in the crawl/index/cdx entry.

Cosmetic

OK

72-80

Should/could it be verified that these patterns are mutually exclusive? Or add to javadoc: that first the name of a heritrixfile is tested against the reportfile pattern, then again the logfile pattern. If the name matches neither of these, it is considered a setup file.

Cosmetic

OK

126

add javadoc

Cosmetic

OK

130

Add javadoc

Cosmetic

OK

134

Replace "URLS" with "URLs"

Cosmetic

OK

146

Add javadoc

Cosmetic

OK

IssuesFromNs97 (last edited 2010-08-16 10:24:39 by localhost)