In the course of the discussion on the "File Extension Mismatch - Change Notice" posted on BaseCamp by Opher Kutner (ExL), I noticed that some aspects of file identification/validation/MDextraction in Rosetta need improvement. What got me thinking was the statement that format identification was done by the extractor, which isn't the case. Instead, the actual behaviour is that the MdExtractionPlugin also does the format validation, while the format identification is run separately in DROID. However, the job of extractors is neither format identification nor format validation. Instead, it's metadata extraction only. Usually, a preservation repository can't even know which extractor to apply without having successfully completed identification and validation before. The fact that format validation and metadata extraction are run in the same step introduces problems that could be avoided by dividing them up in separate steps.
From my point of view, the correct workflow would be as follows:
1. Run format identification against FL/PRONOM/YourFormatRegistryHere, get list of recognized formats (sorted descending by certainty of identification)
2. Run format validation routine for elements from the list from step 1, only treat format IDs that were identified with a identification certainty above a certain percentage.
2.a. If two elements from the list have the same identification probability, then both validators need to be executed. Human intervention is necessary to decide which format should be used. However, this should never occur, because PRONOM maintains a distinct prioritisation of format signatures.
2.b. If validation successful, continue with 3.
2.c. If validation fails, go back to 2. and try next element in list.
2.d. If validation fails for all elements of the list, move file to TA.
3. Choose the correct MD extractor, extract MD.
This leaves us with the following classes of errors:
1. Format identification fails, because:
1.a. No recognized formats at all (empty list). This might be caused due to a missing signature in the format registry.
1.b. No formats with sufficient certainty identified.
1.c. One format identified with sufficient certainty, but file extension doesn't match. This might be either caused by:
1.c.1. An extension that is actually wrong for that file type, and needs to be corrected.
1.c.2. A wrong signature that misidentifies a file. (We had that problem in an older FL version, where TIFFs were misidentified as .NEFs due to an overly general signature for the NEF format.)
1.d. Multiple formats identified with sufficient certainty, but no file extension matches.
1.e. Internal error / Format identification plugin crashes.
2. Format validation fails, because:
2.a. If exactly one format was identified with sufficient certainty:
2.a.1. One validator called, file validation doesn't return successfully.
2.b. If multiple formats were identified with sufficient certainty:
2.b.1. Validators for all detected formats were called, none returned successfully. This should never occur.
2.c. Internal error / Validator crashes.
3. MD extraction fails, because:
3.a. No MD extractor plugin for that format available.
3.b. No significant properties / DNX mapping configured for that plugin/format combination at all.
3.c. File doesn't contain any of the significant properties that are configured for that plugin/format combination.
3.d. Internal error / Format extractor plugin crashes.
ELSE One format identified with sufficient certainty, no file extension mismatch, NO ERROR.
The most important part here is that a format is only identified with certainty if a) the signature, MIMEtype and file extension given in the format registry match what is found in the actual file and b) if the validation has finished successfully. Only if both conditions are matched, we can be sure about the result of our format identification. I want to illustrate this with the example of the TIF format. In cases where one or both of the aforementioned conditions are not met, we DON'T have an invalid TIFF. What we DO have is something that looks suspiciously like a TIFF, but doesn't comply with the format specification, and therefore is NOT a TIFF. In other words: a format is ONLY successfully identified if the identification result is confirmed by the validation result.
So, if we assume that the changes suggested above are implemented, the whole File Extension Mismatch Handling issue would become a lot easier. Every file that fails one of the stages ends up in TA and, ideally, will be returned to the producer to be repaired (or at least the producer will be notified about 1. the file being rejected 2. the reason for that including error message 3. the necessity to repair the file and re-ingest it). Alternatively, the TA user could run a repair operation on their own, but in the majority of the cases, the file that caused the error will have to be modified until the mismatch doesn't occur any more. This usually requires human intervention and analysis; just treating the error with a rule and ignoring it doesn't solve the underlying problem. As these underlying problems are usually a risk for the data in the repository, ignoring them is not an option.
The consequences are that:
- unknown formats can't be moved to permanent (and, as outlined, they shouldn't, because knowledge of the format is strict requirement for risk analysis).
- format IDs can't be manually assigned to files at all, because that equals ignoring the issue instead and postponing the solution to the future, hoping that "someone will probably find time to do it then" (hint: they won't), thus effectively ignoring it.
- the structure of plugin types in Rosetta needs to be refined, now strictly separating plugins for identification, validation and MD extraction, and also separating plugins by their intended format use. This probably means that existing plugins will need to be modified.
- for cases where the files can't be returned to the producer and the repair operations have to be done by Rosetta staff:
- a new plugin type "RepairPlugin" could be introduced to run very specific repair operations from within the TA workbench. The current manual approach is extremely cumbersome and doesn't scale at all.
- repair operations in the TA workbench will need to support a higher level of automation to a) classify/group files not only by their IE but also by TypeOfError/numberOfErrors/ingestDate/anythingElseThatMightBeUseful b) run repair operations on all files from the same group or c) initiate repairs via an API (get errorMessages, get broken files from same group, repair outside of Rosetta, replace broken files, NOT using insecure ImportDescriptor.csv files). Also see these public SupportCases in SalesForce:
- 00128163 bulk file replace in TA [Rosetta 18.104.22.168] (https://exlibrisgroup--c.na62.visual.force.com/apex/VF_Case_WithoutJira?id=5006000000kHhL1AAK)
- 00340712 enhance TA file replace assistant [Rosetta 22.214.171.124] (https://exlibrisgroup--c.na62.visual.force.com/apex/VF_Case_WithoutJira?id=500320000151LRhAAM)
- MD extractors are no longer assigned to Classification Groups (at least not exclusively). Instead, MD extractors should also be assignable to single format IDs. For practical reasons, formats could inherit their MD extractor setting from their Classification Group to set sensible defaults for all formats as long as no format-specific MD extractor is configured.
In the course of the discussion on the "File Extension Mismatch - Change Notice" posted on BaseCamp by Opher Kutner (ExL), I noticed that some aspects of file identification/validation/MDextraction in Rosetta need improvement. What got me thinking was the statement that format identification was done by the extractor, which isn't the case. Instead, the actual behaviour is that the MdExtractionPlugin also does the format validation, while the format identification is run separately in DROID. However, the job of extractors is neither format identification nor format validation. Instead, it's metadata extraction only. Usually, a preservation repository can't even know which extractor…3 votes
Rosetta 5.2 has brought a new Storage Migration feature that enables institutions to restructure their permanent storage to fit changing requirements. However, the documentation currently warns that "Disconnecting legacy storage is not recommended as it may prevent the possibility of reverting to a previous IE version." (https://knowledge.exlibrisgroup.com/Rosetta/Product_Documentation/Version_5.2, Preservation Guide, page 185). As far as I can see, this limitation is due to the use of absolute paths in the IE XML files. The paths in older versions of these files cannot be updated without forging provenance information, which, of course, isn't an alternative. However, updating these paths would be the prerequisite for reverting to a previous IE version.
The documentation isn't clear about Rosetta's behaviour concerning older IE versions. These are the scenarios that I see:
- Rosetta might COPY older IE versions to a new storage to have the complete history stored on the new storage. The original copy on the legacy storage is kept to preserve Rosetta functionality concering version recovery.
- Rosetta might MOVE older IE versions to a new storage to have the complete history stored on the new storage. Reverting to older versions would not be possible anymore, but the legacy storage would be cleared of all files and could be removed. The complete provenance information would be kept in older versions of the IE XML, but with some invalid file paths.
- Rosetta might KEEP older IE versions on the legacy storage and write only new information to the new storage. Reverting to older IE versions would be possible, but the legacy storage could never ever be removed. You couldn't even clean up the mount points without losing access to older versions.
From my point of view, a possible way to go would be to change Rosetta's behaviour concerning file paths in the IE XML. Currently, absolute paths are used to point from the IE XML to the payload files/master images. However, using relative paths instead would make more sense here, because (at least on our storage. Comments?) IE XML files and payload files are always kept closely together anyway, so path complexity could be removed. Also, storage migrations would not make a path rewrite necessary, because the files can be addressed by the same relative path. For older IEs, that would mean that only versions created before the first storage migration cannot be reverted to. All subsequent versions would contain relative paths and could potentially be addressed even after a storage migration. For newer IEs (ingested after change to relative paths), that would mean that all versions can be addressed, even after storage migrations.
As this idea is somewhat storage specific (depending on storage layout, storage plugins etc.), I'd appreciate comments from other customers that see possible caveats. Context about SLUB's storage layout and our plans to redesign the storage can be found in the public SupportCase 00345262 "migrating permanent storage to new path [Rosetta 126.96.36.199]".
Rosetta 5.2 has brought a new Storage Migration feature that enables institutions to restructure their permanent storage to fit changing requirements. However, the documentation currently warns that "Disconnecting legacy storage is not recommended as it may prevent the possibility of reverting to a previous IE version." (https://knowledge.exlibrisgroup.com/Rosetta/Product_Documentation/Version_5.2, Preservation Guide, page 185). As far as I can see, this limitation is due to the use of absolute paths in the IE XML files. The paths in older versions of these files cannot be updated without forging provenance information, which, of course, isn't an alternative. However, updating these paths would…7 votes
Besides of having the option to delete or delete permanently, a retention policy should also allow not to delete at all. Many institutions want to assign a retention period to their objects, but do not want to delete the IEs when the retention period is over. Instead they expect to receive a report where they can decide themselves which objects should be deleted.12 votes
- Don't see your idea?