Review (NS-87): FR1678: Improved indexing for wayback

Author

Colin

Moderator

Colin

State

Closed

Objectives

See https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1678
and http://netarchive.dk/suite/ImprovedIndexing
The implemented code includes:
A batch job to extract wayback cdx indexes from archive arc files
A batch job to extract wayback cdx indexes from deduplication metadata records
An application to extract wayback cdx indexes from deduplication crawl logs
 + associated helper methods

Summary

follow-up: csr

Total Time Used (Coding,Documentation,Review):

CSR:4 MD
SVC:0.5 MD

General comments:

Description

Classification

Status

On several of the files you need to set the SVN "svn:keywords" property with value=URL Revision Author Date Id

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/DeduplicateToCDXApplication.java', revision 995

Lines

Description

Classification

Status

28

blank line

NA

NOTOK

50

Missing argument validation

Cosmetic

NOTOK

55

explicitly close stream

Minor

NOTOK

70

[spelling] wyaback => wayback

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/ExtractWaybackCDXBatchJob.java', revision 995

Lines

Description

Classification

Status

43-44

We don't include that kind of information in the NetarchiveSuite javadoc

Cosmetic

OK

52-53

Missing javadoc

Cosmetic

OK

55

Missing javadoc

Cosmetic

OK

64

missing javadoc

Cosmetic

OK

84

javadoc

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/ExtractDeduplicateCDXBatchJob.java', revision 995

Lines

Description

Classification

Status

General

Remove underscores in variablenames. This violates our coding style

Cosmetic

OK

30

Unnecessary blank lines in import block

NA

OK

42

Remove unused line

Cosmetic

OK

48

Missing javadoc

Cosmetic

OK

53

missing javadoc

NA

OK

63

missing javadoc

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/UrlCanonicalizerFactory.java', revision 995

Lines

Description

Classification

Status

31

Missing period in first sentence of javadoc

Cosmetic

OK

47

use try/catch on securityexception

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/DeduplicateToCDXAdapter.java', revision 995

Lines

Description

Classification

Status

General

Remove underscores in variablenames. This violates our coding style

Cosmetic

OK

General

Divide lines longer than 80 characters into two lines.

Cosmetic

OK

41

Missing class javadoc

NA

OK

48-58

Missing javadoc

Cosmetic

OK

60

Missing javadoc

Cosmetic

OK

62

Missing javadoc

Cosmetic

OK

66

Missing javadoc, and missing validation of argument 'line'

Cosmetic

OK

67

Make a constant for the "duplicate:" string

Cosmetic

OK

108

Missing javadoc and argument validation

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/DeduplicateToCDXAdapterInterface.java', revision 995

Lines

Description

Classification

Status

30

Missing period in first sentence of javadoc

Cosmetic

OK

37

What type of canonicalization is done on the target url

Cosmetic

OK

46

Replace "dedup lines" with "lines containing deduplication information" or similar

Cosmetic

TOK

47

Missing period in first sentence of javadoc

Cosmetic

OK

Comments on file 'trunk/src/dk/netarkivet/wayback/WaybackSettings.java', revision 995

Lines

Description

Classification

Status

General

Missing svn svn:keywords property with value=URL Revision Author Date Id

Cosmetic

OK

1

File headers/copyright missing

Cosmetic

OK

25

Missing javadoc

Cosmetic

OK

IssuesFromNs87 (last edited 2010-08-16 10:25:07 by localhost)