untangle format identification/validation/MDextraction
In the course of the discussion on the "File Extension Mismatch - Change Notice" posted on BaseCamp by Opher Kutner (ExL), I noticed that some aspects of file identification/validation/MDextraction in Rosetta need improvement. What got me thinking was the statement that format identification was done by the extractor, which isn't the case. Instead, the actual behaviour is that the MdExtractionPlugin also does the format validation, while the format identification is run separately in DROID. However, the job of extractors is neither format identification nor format validation. Instead, it's metadata extraction only. Usually, a preservation repository can't even know which extractor to apply without having successfully completed identification and validation before. The fact that format validation and metadata extraction are run in the same step introduces problems that could be avoided by dividing them up in separate steps.
From my point of view, the correct workflow would be as follows:
1. Run format identification against FL/PRONOM/YourFormatRegistryHere, get list of recognized formats (sorted descending by certainty of identification)
2. Run format validation routine for elements from the list from step 1, only treat format IDs that were identified with a identification certainty above a certain percentage.
2.a. If two elements from the list have the same identification probability, then both validators need to be executed. Human intervention is necessary to decide which format should be used. However, this should never occur, because PRONOM maintains a distinct prioritisation of format signatures.
2.b. If validation successful, continue with 3.
2.c. If validation fails, go back to 2. and try next element in list.
2.d. If validation fails for all elements of the list, move file to TA.
3. Choose the correct MD extractor, extract MD.
This leaves us with the following classes of errors:
1. Format identification fails, because:
1.a. No recognized formats at all (empty list). This might be caused due to a missing signature in the format registry.
1.b. No formats with sufficient certainty identified.
1.c. One format identified with sufficient certainty, but file extension doesn't match. This might be either caused by:
1.c.1. An extension that is actually wrong for that file type, and needs to be corrected.
1.c.2. A wrong signature that misidentifies a file. (We had that problem in an older FL version, where TIFFs were misidentified as .NEFs due to an overly general signature for the NEF format.)
1.d. Multiple formats identified with sufficient certainty, but no file extension matches.
1.e. Internal error / Format identification plugin crashes.
2. Format validation fails, because:
2.a. If exactly one format was identified with sufficient certainty:
2.a.1. One validator called, file validation doesn't return successfully.
2.b. If multiple formats were identified with sufficient certainty:
2.b.1. Validators for all detected formats were called, none returned successfully. This should never occur.
2.c. Internal error / Validator crashes.
3. MD extraction fails, because:
3.a. No MD extractor plugin for that format available.
3.b. No significant properties / DNX mapping configured for that plugin/format combination at all.
3.c. File doesn't contain any of the significant properties that are configured for that plugin/format combination.
3.d. Internal error / Format extractor plugin crashes.
ELSE One format identified with sufficient certainty, no file extension mismatch, NO ERROR.
The most important part here is that a format is only identified with certainty if a) the signature, MIMEtype and file extension given in the format registry match what is found in the actual file and b) if the validation has finished successfully. Only if both conditions are matched, we can be sure about the result of our format identification. I want to illustrate this with the example of the TIF format. In cases where one or both of the aforementioned conditions are not met, we DON'T have an invalid TIFF. What we DO have is something that looks suspiciously like a TIFF, but doesn't comply with the format specification, and therefore is NOT a TIFF. In other words: a format is ONLY successfully identified if the identification result is confirmed by the validation result.
So, if we assume that the changes suggested above are implemented, the whole File Extension Mismatch Handling issue would become a lot easier. Every file that fails one of the stages ends up in TA and, ideally, will be returned to the producer to be repaired (or at least the producer will be notified about 1. the file being rejected 2. the reason for that including error message 3. the necessity to repair the file and re-ingest it). Alternatively, the TA user could run a repair operation on their own, but in the majority of the cases, the file that caused the error will have to be modified until the mismatch doesn't occur any more. This usually requires human intervention and analysis; just treating the error with a rule and ignoring it doesn't solve the underlying problem. As these underlying problems are usually a risk for the data in the repository, ignoring them is not an option.
The consequences are that:
- unknown formats can't be moved to permanent (and, as outlined, they shouldn't, because knowledge of the format is strict requirement for risk analysis).
- format IDs can't be manually assigned to files at all, because that equals ignoring the issue instead and postponing the solution to the future, hoping that "someone will probably find time to do it then" (hint: they won't), thus effectively ignoring it.
- the structure of plugin types in Rosetta needs to be refined, now strictly separating plugins for identification, validation and MD extraction, and also separating plugins by their intended format use. This probably means that existing plugins will need to be modified.
- for cases where the files can't be returned to the producer and the repair operations have to be done by Rosetta staff:
- a new plugin type "RepairPlugin" could be introduced to run very specific repair operations from within the TA workbench. The current manual approach is extremely cumbersome and doesn't scale at all.
- repair operations in the TA workbench will need to support a higher level of automation to a) classify/group files not only by their IE but also by TypeOfError/numberOfErrors/ingestDate/anythingElseThatMightBeUseful b) run repair operations on all files from the same group or c) initiate repairs via an API (get errorMessages, get broken files from same group, repair outside of Rosetta, replace broken files, NOT using insecure ImportDescriptor.csv files). Also see these public SupportCases in SalesForce:
- 00128163 bulk file replace in TA Rosetta 188.8.131.52
- 00340712 enhance TA file replace assistant Rosetta 184.108.40.206
- MD extractors are no longer assigned to Classification Groups (at least not exclusively). Instead, MD extractors should also be assignable to single format IDs. For practical reasons, formats could inherit their MD extractor setting from their Classification Group to set sensible defaults for all formats as long as no format-specific MD extractor is configured.
Franziska Geisser commented
In general, I would support the idea of separating more strictly the mechanisms of format identification, validation and metadata extraction. To the third paragraph listing the classes of errors, two elements might be added:
1.c.3: An extension that is not yet recorded in FL/PRONOM/YourFormatRegistryHere for the identified format
2.d: No Validator for that format available (equivalent to 3.a)
From the second addition follows a general concern:
If the absence of a Validator or MD extractor plugin is regarded as an error and makes the file end up in TA, we (speaking from the perspective of an archive that has to handle a great variety of file formats) would face a lot of extra work in the TA workbench. It seems to me that Jörg’s idea is written from the point of view of an archive dealing with a limited number of well-established formats regarded as suitable for long-term preservation. For this almost ideal situation, it seems appropriate to set strict standards and voice the consequences as rigidly as it is done in the last paragraph. But in sordid reality, we need to be able to let unknown formats pass into the permanent repository – think of research data in specialized formats that are not yet recorded in PRONOM and cannot be converted into standard formats without losing vital information. While the decision to accept or not to accept unknown format should be guided by an institution’s policy, the digital preservation system itself should not in any way impose a specific policy, but should grant users the full scope of action.
Kris Dekeyser commented
"tab indentation" ... make that Markdown syntax and I would be happy too ;)
Jörg Sachse commented
@Stuart: Thanks, I think that's a great idea, especially considering how weak of a file type indicator file extensions actually are.
stuart yeates commented
I think that this is would definitely be an improvement. I suggest, however, that:
Mime-type and/or file extension be used to add formats to the low-priority end of list of candidate file formats. This catches cases where the validator has better coverage than "FL/PRONOM/YourFormatRegistryHere" AND either mime-type or file extension are correct.
Jörg Sachse commented
Oh, and if the IdeaExchange software would support tab indentation, that would be really helpful. Thx!