untangle format identification/validation/MDextraction
In the course of the discussion on the "File Extension Mismatch - Change Notice" posted on BaseCamp by Opher Kutner (ExL), I noticed that some aspects of file identification/validation/MDextraction in Rosetta need improvement. What got me thinking was the statement that format identification was done by the extractor, which isn't the case. Instead, the actual behaviour is that the MdExtractionPlugin also does the format validation, while the format identification is run separately in DROID. However, the job of extractors is neither format identification nor format validation. Instead, it's metadata extraction only. Usually, a preservation repository can't even know which extractor to apply without having successfully completed identification and validation before. The fact that format validation and metadata extraction are run in the same step introduces problems that could be avoided by dividing them up in separate steps.
From my point of view, the correct workflow would be as follows:
1. Run format identification against FL/PRONOM/YourFormatRegistryHere, get list of recognized formats (sorted descending by certainty of identification)
2. Run format validation routine for elements from the list from step 1, only treat format IDs that were identified with a identification certainty above a certain percentage.
2.a. If two elements from the list have the same identification probability, then both validators need to be executed. Human intervention is necessary to decide which format should be used. However, this should never occur, because PRONOM maintains a distinct prioritisation of format signatures.
2.b. If validation successful, continue with 3.
2.c. If validation fails, go back to 2. and try next element in list.
2.d. If validation fails for all elements of the list, move file to TA.
3. Choose the correct MD extractor, extract MD.
This leaves us with the following classes of errors:
1. Format identification fails, because:
1.a. No recognized formats at all (empty list). This might be caused due to a missing signature in the format registry.
1.b. No formats with sufficient certainty identified.
1.c. One format identified with sufficient certainty, but file extension doesn't match. This might be either caused by:
1.c.1. An extension that is actually wrong for that file type, and needs to be corrected.
1.c.2. A wrong signature that misidentifies a file. (We had that problem in an older FL version, where TIFFs were misidentified as .NEFs due to an overly general signature for the NEF format.)
1.d. Multiple formats identified with sufficient certainty, but no file extension matches.
1.e. Internal error / Format identification plugin crashes.
2. Format validation fails, because:
2.a. If exactly one format was identified with sufficient certainty:
2.a.1. One validator called, file validation doesn't return successfully.
2.b. If multiple formats were identified with sufficient certainty:
2.b.1. Validators for all detected formats were called, none returned successfully. This should never occur.
2.c. Internal error / Validator crashes.
3. MD extraction fails, because:
3.a. No MD extractor plugin for that format available.
3.b. No significant properties / DNX mapping configured for that plugin/format combination at all.
3.c. File doesn't contain any of the significant properties that are configured for that plugin/format combination.
3.d. Internal error / Format extractor plugin crashes.
ELSE One format identified with sufficient certainty, no file extension mismatch, NO ERROR.
The most important part here is that a format is only identified with certainty if a) the signature, MIMEtype and file extension given in the format registry match what is found in the actual file and b) if the validation has finished successfully. Only if both conditions are matched, we can be sure about the result of our format identification. I want to illustrate this with the example of the TIF format. In cases where one or both of the aforementioned conditions are not met, we DON'T have an invalid TIFF. What we DO have is something that looks suspiciously like a TIFF, but doesn't comply with the format specification, and therefore is NOT a TIFF. In other words: a format is ONLY successfully identified if the identification result is confirmed by the validation result.
So, if we assume that the changes suggested above are implemented, the whole File Extension Mismatch Handling issue would become a lot easier. Every file that fails one of the stages ends up in TA and, ideally, will be returned to the producer to be repaired (or at least the producer will be notified about 1. the file being rejected 2. the reason for that including error message 3. the necessity to repair the file and re-ingest it). Alternatively, the TA user could run a repair operation on their own, but in the majority of the cases, the file that caused the error will have to be modified until the mismatch doesn't occur any more. This usually requires human intervention and analysis; just treating the error with a rule and ignoring it doesn't solve the underlying problem. As these underlying problems are usually a risk for the data in the repository, ignoring them is not an option.
The consequences are that:
- unknown formats can't be moved to permanent (and, as outlined, they shouldn't, because knowledge of the format is strict requirement for risk analysis).
- format IDs can't be manually assigned to files at all, because that equals ignoring the issue instead and postponing the solution to the future, hoping that "someone will probably find time to do it then" (hint: they won't), thus effectively ignoring it.
- the structure of plugin types in Rosetta needs to be refined, now strictly separating plugins for identification, validation and MD extraction, and also separating plugins by their intended format use. This probably means that existing plugins will need to be modified.
- for cases where the files can't be returned to the producer and the repair operations have to be done by Rosetta staff:
- a new plugin type "RepairPlugin" could be introduced to run very specific repair operations from within the TA workbench. The current manual approach is extremely cumbersome and doesn't scale at all.
- repair operations in the TA workbench will need to support a higher level of automation to a) classify/group files not only by their IE but also by TypeOfError/numberOfErrors/ingestDate/anythingElseThatMightBeUseful b) run repair operations on all files from the same group or c) initiate repairs via an API (get errorMessages, get broken files from same group, repair outside of Rosetta, replace broken files, NOT using insecure ImportDescriptor.csv files). Also see these public SupportCases in SalesForce:
- 00128163 bulk file replace in TA [Rosetta 220.127.116.11] (https://exlibrisgroup--c.na62.visual.force.com/apex/VF_Case_WithoutJira?id=5006000000kHhL1AAK)
- 00340712 enhance TA file replace assistant [Rosetta 18.104.22.168] (https://exlibrisgroup--c.na62.visual.force.com/apex/VF_Case_WithoutJira?id=500320000151LRhAAM)
- MD extractors are no longer assigned to Classification Groups (at least not exclusively). Instead, MD extractors should also be assignable to single format IDs. For practical reasons, formats could inherit their MD extractor setting from their Classification Group to set sensible defaults for all formats as long as no format-specific MD extractor is configured.
Jörg Sachse commented
@Stuart: Thanks, I think that's a great idea, especially considering how weak of a file type indicator file extensions actually are.
stuart yeates commented
I think that this is would definitely be an improvement. I suggest, however, that:
Mime-type and/or file extension be used to add formats to the low-priority end of list of candidate file formats. This catches cases where the validator has better coverage than "FL/PRONOM/YourFormatRegistryHere" AND either mime-type or file extension are correct.
Jörg Sachse commented
Oh, and if the IdeaExchange software would support tab indentation, that would be really helpful. Thx!