Diff for "Improve on deduplication"

Differences between revisions 1 and 2

== Estimate time for Improving deduplication: 14 MD =

Degree of uncertainty: B

Responsible: SVC

Improving deduplication

Introduction

Generating deduplicating indices is currently error-prone and time-consuming. It is however my belief, that the OOM bugs in the batch monitoring software may limit the errors until now seen while generating indices.

Task 1: Try to retrieve data from other bitarchive replica if the local bitarchive replica is unresponsive

Refer to FR XXXXX.

Task 2: Optimize the fetching process.

To generate the deduplication indices, it needs to fetch a lot of metadata-records, specifically the records containing the crawl-logs, and cdx-entries.

Currently, our software for each job-id included in the dedup index, launches two batch-jobs, one for the cdxes, and one for the crawllog. This could maybe be reduced to one job, that maybe also does some of the processing locally on the storage-machines.

Task 3: Optimize the indexing process.

Currently, part of indexing process is a merging of the information found in the CDX-file with the information found in the crawllog. Investigate if this merging of information could be done as part of the post-processing, or as a pre-processing step, if this information is absent in the metadata-records.

-  ⇤ ← Revision 1 as of 2010-12-01 15:22:22 → 
  Size: 82
  Editor: SoerenCarlsen
  Comment:
+   ← Revision 2 as of 2010-12-01 15:42:16 → ⇥
  Size: 1340
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Estimate time for Improving deduplication: ...... =
+== Estimate time for Improving deduplication: 14 MD =
== Degree of uncertainty: B ==
== Responsible: SVC ==
-Line 3:
+Line 5:
-== Responsible: SVC ==
+== Improving deduplication ==

=== Introduction ===

Generating deduplicating indices is currently error-prone and time-consuming.
It is however my belief, that the OOM bugs in the batch monitoring software may limit the
errors until now seen while generating indices.

Task 1: Try to retrieve data from other bitarchive replica if the local bitarchive replica is unresponsive

Refer to FR XXXXX.

Task 2: Optimize the fetching process.

To generate the deduplication indices, it needs to fetch a lot of metadata-records, specifically the records containing the crawl-logs, and cdx-entries. 

Currently, our software for each job-id included in the dedup index, launches two batch-jobs, one for the cdxes, and one for the crawllog. This could maybe be reduced to one job, that maybe also does some of the processing locally on the storage-machines.

Task 3: Optimize the indexing process.

Currently, part of indexing process is a merging of the information found in the CDX-file with the information found in the crawllog.
Investigate if this merging of information could be done as part of the post-processing, or as a pre-processing step, if this information is absent in the metadata-records.