= Review (NS-97): FR 1691 (supersedes NS-95) = || Author || BNF user || || Moderator || Søren || || State || Review || == Objectives == {{{ 1017: [FROM NICOLAS @ BNF] ------------------------------------------------------------------------ FR 1691 Configure which Heritrix reports to include in metadata ARC file ------------------------------------------------------------------------ Three new setting properties have been added: - settings.harvester.harvesting.metadata.heritrixFilePattern is a java pattern that allows to filter which files in the crawl dir (not recursively) to include in the lmetadata ARC. - settings.harvester.harvesting.metadata.reportFilePattern is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files. All the other files will be considered as setup files. - settings.harvester.harvesting.metadata.logFilePattern is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC. NOTE: FR 1691 also addresses Bug 808 (Should store crawl-manifest.txt) This bug also suggests storing the conf/heritrix_properties, but why not take all files in conf and modules. }}} '''Total Time Used (Coding,Documentation,Review)''': {{{ Time use (Coding,Documentation,Review) Nicolas: 2 MD SVC: 0.5 MD }}} '''General comments''': || '''Description''' || '''Classification''' || '''Status''' || === Comments on file 'trunk/src/dk/netarkivet/harvester/HarvesterSettings.java', revision 1017 === || '''Lines''' || '''Description''' || '''Classification''' || '''Status''' || === Comments on file 'trunk/src/dk/netarkivet/harvester/harvesting/HarvestDocumentation.java', revision 1017 === || '''Lines''' || '''Description''' || '''Classification''' || '''Status''' || || 439 || Maybe log here, which logfiles we have found. || Cosmetic || OK || === Comments on file 'trunk/src/dk/netarkivet/harvester/harvesting/MetadataFile.java', revision 1019 === || '''Lines''' || '''Description''' || '''Classification''' || '''Status''' || || General || We have two patterns for entries in the metadata.arc file, for the cdx-entry in dk.netarkivet.archive.indexserver.CDXDataCache.java , and for the crawl-log entry in CrawlLogDataCache.java I suggest they be moved to the MetadataFile class || Cosmetic || OK || || 36, 43 || Delete @author tag; not used in NetarchiveSuite codestyle || Cosmetic || OK || || 45 || Probably need an new MetadataType called 'index' to fit in the crawl/index/cdx entry. || Cosmetic || OK || || 72-80 || Should/could it be verified that these patterns are mutually exclusive? Or add to javadoc: that first the name of a heritrixfile is tested against the reportfile pattern, then again the logfile pattern. If the name matches neither of these, it is considered a setup file. || Cosmetic ||OK || || 126 || add javadoc || Cosmetic || OK || || 130 || Add javadoc || Cosmetic || OK || || 134 || Replace "URLS" with "URLs" || Cosmetic || OK || || 146 || Add javadoc || Cosmetic || OK ||