⇤ ← Revision 1 as of 2010-12-01 15:22:22
82
Comment:
|
1340
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Estimate time for Improving deduplication: ...... = | == Estimate time for Improving deduplication: 14 MD = == Degree of uncertainty: B == == Responsible: SVC == |
Line 3: | Line 5: |
== Responsible: SVC == | == Improving deduplication == === Introduction === Generating deduplicating indices is currently error-prone and time-consuming. It is however my belief, that the OOM bugs in the batch monitoring software may limit the errors until now seen while generating indices. Task 1: Try to retrieve data from other bitarchive replica if the local bitarchive replica is unresponsive Refer to FR XXXXX. Task 2: Optimize the fetching process. To generate the deduplication indices, it needs to fetch a lot of metadata-records, specifically the records containing the crawl-logs, and cdx-entries. Currently, our software for each job-id included in the dedup index, launches two batch-jobs, one for the cdxes, and one for the crawllog. This could maybe be reduced to one job, that maybe also does some of the processing locally on the storage-machines. Task 3: Optimize the indexing process. Currently, part of indexing process is a merging of the information found in the CDX-file with the information found in the crawllog. Investigate if this merging of information could be done as part of the post-processing, or as a pre-processing step, if this information is absent in the metadata-records. |
== Estimate time for Improving deduplication: 14 MD =
Degree of uncertainty: B
Responsible: SVC
Improving deduplication
Introduction
Generating deduplicating indices is currently error-prone and time-consuming. It is however my belief, that the OOM bugs in the batch monitoring software may limit the errors until now seen while generating indices.
Task 1: Try to retrieve data from other bitarchive replica if the local bitarchive replica is unresponsive
Refer to FR XXXXX.
Task 2: Optimize the fetching process.
To generate the deduplication indices, it needs to fetch a lot of metadata-records, specifically the records containing the crawl-logs, and cdx-entries.
Currently, our software for each job-id included in the dedup index, launches two batch-jobs, one for the cdxes, and one for the crawllog. This could maybe be reduced to one job, that maybe also does some of the processing locally on the storage-machines.
Task 3: Optimize the indexing process.
Currently, part of indexing process is a merging of the information found in the CDX-file with the information found in the crawllog. Investigate if this merging of information could be done as part of the post-processing, or as a pre-processing step, if this information is absent in the metadata-records.