Review (NS-87): FR1678: Improved indexing for wayback

Author

Colin

Moderator

Colin

State

Closed

Objectives

See https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1678
and http://netarchive.dk/suite/ImprovedIndexing

The implemented code includes:
A batch job to extract wayback cdx indexes from archive arc files
A batch job to extract wayback cdx indexes from deduplication metadata records
An application to extract wayback cdx indexes from deduplication crawl logs
 + associated helper methods

Summary

follow-up: csr

Total Time Used (Coding,Documentation,Review):

CSR:4 MD
SVC:0.5 MD

General comments:

Description

Classification

Status

On several of the files you need to set the SVN "svn:keywords" property with value=URL Revision Author Date Id

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/DeduplicateToCDXApplication.java', revision 995

Lines

Description

Classification

Status

28

blank line

NA

NOTOK

50

Missing argument validation

Cosmetic

NOTOK

55

explicitly close stream

Minor

NOTOK

70

[spelling] wyaback => wayback

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/ExtractWaybackCDXBatchJob.java', revision 995

Lines

Description

Classification

Status

43-44

We don't include that kind of information in the NetarchiveSuite javadoc

Cosmetic

NOTOK

52-53

Missing javadoc

Cosmetic

NOTOK

55

Missing javadoc

Cosmetic

NOTOK

64

missing javadoc

Cosmetic

NOTOK

84

javadoc

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/ExtractDeduplicateCDXBatchJob.java', revision 995

Lines

Description

Classification

Status

General

Remove underscores in variablenames. This violates our coding style

Cosmetic

NOTOK

30

Unnecessary blank lines in import block

NA

NOTOK

42

Remove unused line

Cosmetic

NOTOK

48

Missing javadoc

Cosmetic

NOTOK

53

missing javadoc

NA

NOTOK

63

missing javadoc

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/UrlCanonicalizerFactory.java', revision 995

Lines

Description

Classification

Status

31

Missing period in first sentence of javadoc

Cosmetic

NOTOK

47

use try/catch on securityexception

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/DeduplicateToCDXAdapter.java', revision 995

Lines

Description

Classification

Status

General

Remove underscores in variablenames. This violates our coding style

Cosmetic

NOTOK

General

Divide lines longer than 80 characters into two lines.

Cosmetic

NOTOK

41

Missing class javadoc

NA

NOTOK

48-58

Missing javadoc

Cosmetic

NOTOK

60

Missing javadoc

Cosmetic

NOTOK

62

Missing javadoc

Cosmetic

NOTOK

66

Missing javadoc, and missing validation of argument 'line'

Cosmetic

NOTOK

67

Make a constant for the "duplicate:" string

Cosmetic

NOTOK

108

Missing javadoc and argument validation

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/batch/DeduplicateToCDXAdapterInterface.java', revision 995

Lines

Description

Classification

Status

30

Missing period in first sentence of javadoc

Cosmetic

NOTOK

37

What type of canonicalization is done on the target url

Cosmetic

NOTOK

46

Replace "dedup lines" with "lines containing deduplication information" or similar

Cosmetic

NOTOK

47

Missing period in first sentence of javadoc

Cosmetic

NOTOK

Comments on file 'trunk/src/dk/netarkivet/wayback/WaybackSettings.java', revision 995

Lines

Description

Classification

Status

General

Missing svn svn:keywords property with value=URL Revision Author Date Id

Cosmetic

NOTOK

1

File headers/copyright missing

Cosmetic

NOTOK

25

Missing javadoc

Cosmetic

NOTOK