Rework/rewrite AIP update functionality
Since version 4, Rosetta includes a feature called AIP update. This feature is used mainly by libraries to alter data in the permanent repository, which might be necessary for a number of reasons. Metadata might need updating due to a typo or due to new research results, new files that weren't included in the original ingest might be added, or existing files might be altered with higher resolution versions or to repair defective files. Libraries tend to use this feature, because they regularly ingest SIPs that include a minimal set of metadata. Later, when full metadata have been allotted or OCR processing has been done, that data is added and Rosetta creates a new AIP version while preserving the original version.
In the status that AIP update has in current Rosetta versions, we see various design flaws that endanger stability, robustness and safety of our data. We'd like to outline the issues that we see and suggest a number of changes to improve on those issues and make AIP update more secure and usable for the Rosetta user community. Please note that we use AIP update exclusively via API, which is necessary for running automated ingest/update/access workflows. SLUB has met all of these issues in real world production.
One key requirement for successfully processing AIP updates in an automated workflow is the ability to distinguish between three alternatives. Any package that is ingested into Rosetta can be: 1. a new ingest, 2. an AIP update of an existing unlocked and finished IE in the permanent repository, 3. undecidable (internal processing in Rosetta hasn't finished yet, so the ingest has to wait for the moment). The basis for this decision is a combination of the data returned from querying the system's APIs, namely the number of SIPs that contain the external identifier (as given in the SIP MD), the number of SIPs that contain the external identifier (as returned by the SRU search through the indexed IE-DC MD) and the processing status of each of these SIPs. To do so, preservation repository software needs to provide gapless tracking of the processing status for each IE right from the ingest. IE metadata contain these identifies, but they are currently only indexed at the very end of the processing after moving the IE to the permanent repository. Until that happens, we cannot use the SRU search (which relies in the Solr index) to find out if an IE with a given external identifier is already in the system, regardless of its current processing status. Due to this limitation, we only ingest exactly one IE with each SIP, so there's always a distinct mapping from SIP ID to IE PID and vice versa. That way, we could use the SIP metadata that is stored in the database to find out if there's an IE with a given external identifier in the system even before it reaches the permanent repository. Until recently, we used a custom built workaround that is no longer supported in Rosetta 5.2, including a BIRT report and adventurous authentication methods to cover most of the gap between LOADING stage and indexing. However, there was a remaining gap between the moment when the SIP status was set to "module PERMANENT, stage FINISHED, status FINISHED" and the moment when the index processing has been completed. This gap was supposed to be closed in Rosetta 5.1, when a new API was created to replace the BIRT report and create a stable and supported solution. However, the new API does not cover the functionality of the BIRT report, making it unusable for controlling automated AIP updates. The legacy BIRT report solution returned all SIPs in the system for a given external ID, including legitimate SIPs as well as rejected and declined SIPs. The new API, however, only returns a SIP's status if exactly one SIP ID for a given external ID was found. If there's more than one SIP for a given external ID, Rosetta just returns an error instead of returning a status list for all SIPs that were found.
The solution for this situation would be as follows:
- Move the indexing of all metadata as far to the beginning to the SIP processing as possible. Performance considerations, which were used to justify keeping the indexing of the IE metadata at the very end of the processing and even running it after the SIP is in "module PERMANENT, stage FINISHED, status FINISHED" (so, practically lying to the user and external software), are not valid on recent hardware and do not justify neither the need for workarounds nor the danger imposed on the preservation data.
- Index SIP MD as well. As far as I saw in the Rosetta 5.3 announcements, this is planned for the next version and will solve some of hte problems that we're having.
- Enhance the getSIPStatusInfoByExternalId() API (documented here https://developers.exlibrisgroup.com/rosetta/apis/Sipweb services) to provide the functionality needed and described above.
Any information about exactly which files are supposed to be updated with which replacements are currently kept in a CSV file stored below /operational_shared/. In in our environment, this is an NFSv3 share that is mounted with nolock option set due to ExLibris requirements (which, in itself, would be dangerous enough). However, as writing to files on network shares is a non-atomic operation, the CSV files can get corrupted if there's a network glitch. This will result in an AIP update going wrong with no indication whatsoever, as we had to experience first hand. From our point of view, the correct way to do this would be by keeping all AIP update information in the database, where operations are transactional and can be recovered a lot easier even if there's an issue with the database (i.e. via Oracle FlashBack Recovery or something similar). This approach would offer multiple advantages:
1. The AIP update information is kept safe in the database.
2. The information is kept in just one place instead of splitting it between the database and the file system.
3. Network load on the NFS shares is reduced.
4. The database handles concurrent access a lot better than NFS shares do.
Taking the importance of the data in Rosetta installations around the world and the perspective preservation durations into consideration, its bordering irresponsible to prefer file operations over database operations for control mechanisms.
After calling the updateRepresentation() web service, Rosetta returns a RIP ID that can be used to see the status of the process. While this is a good idea in general, there are some issues with the actual implementation. As of Rosetta 5.1.0.2, the RIP ID is deleted from Rosetta immediately after the AIP update process completes. Small AIP updates tend to finish really fast, so if the submission application queries Rosetta about that RIP ID for the first time, it might already be gone and Rosetta returns an error indicating that the RIP ID is unknown. We have contemplated using this behaviour as an indicator that the AIP update has finished successfully, but according to ExLibris, this wouldn't be acceptable even for a workaround (their words). Anyway, the correct behaviour for a well-designed API would be to:
- issue a RIP ID on updateRepresentation() call
- return a "running" status for that RIP ID as long as the process hasn't finished
- return an "error" status for that RIP ID if anything unexpected occurs
- return a "finished" status for that RIP ID as soon as the AIP update process has finished successfully
- return any other status messages that might be needed
- EITHER delete only after at least 24h after successfully finishing the AIP update OR (preferred) never delete RIP IDs and use suitably large data type (i.e. long unsigned integer) for these IDs to make sure the system never runs out of IDs
- return an "unknown RIP ID" message only if the RIP ID actually has never existed or doesn't exist any more
SupportCase 00366236 "Error in RIP Loading, IE (FileNotFoundException: Unable to read file: Import Descriptor.csv)" describes some more details about how we found out about this issue.
In any case, software using the API must be 100% confident about the process status of AIP updates the manipulate the actual repository master data.
From what we see in the Rosetta 5.2 changelog, ExLibris has implemented some changes in this area. We have yet to test the extent and reliability of the implementation.
From what we understand about the inner workings of Rosetta, some of Rosetta's internal tasks are handled via web services as well. However, these web services are undocumented. We learned about one of these web services when we encountered a failed AIP update (SupportCase 00376523 "failed AIP update on IE3761638 [Rosetta 5.0.1.1]", error message: 'SocketException from an internal web service "PermanentManagerWS": java.net.SocketException: SocketException invoking http://langzeitarchiv.slub-dresden.de/dpsws/permanent/PermanentManagerWS: Connection reset'). The search for the root cause is ongoing, but our assumption is that an internal timeout for this web service hit us before the process had finished. We only noticed this error because we regularly grep the server.log for error messages; Rosetta itself didn't give any indication about something being wrong with our AIP update whatsoever. This means that we nearly missed a failed process that is supposed to alter master production data with the AIP still being locked in mid-process. We suggest to document all web services and improve API error notification to make sure cases like this one are caught even in fully automated production scenarios. For us, this is especially critical because of our specific production workflow where multiple AIP updates are needed to represent one change that came for image production. If one of those updates fails, the AIP ends up in a semantically inconsistent state where medatada might point to files that don't exist yet and were supposed to be added in the subsequent AIP update.
[continued in comments]

A major aspect of this request has been delivered as part of v6.2 – support for control over version increments through APIs – and as agreed, the area of AIP updates will continue being discussed through the working groups. Due to this, I will close the issue as completed.
Thank you,
Daniel
-
Hi,
A major aspect of this request has been delivered as part of v6.2 - support for control over version increments through APIs - and as agreed, the area of AIP updates will continue being discussed through the working groups. Due to this, I will close the issue as completed.
Thank you,
Daniel -
Hi Jörg. Thank you for your detailed and useful feedback, we are examining the various requests you raised regarding the AIP update process. As your request relates to many aspects of the process, requiring non-trivial effort to tackle the majority of them, I would like to suggest the following:
1. We understand one of the central pain-points is the version increment handling, whereby each file / MD update creates a new IE version. We could design and implement a solution where certain APIs could be run without commit and add a new commit/rollback API. This will grant the institution control over the version increments.
2. We would very much like to address as many issues as possible, but as our resources are limited we cannot cover such a vast set of enhancements in addition to other major work-group features in the pipeline. Therefore for the remaining issues, I would like to suggest creating a list of defined requirements, perhaps via the working groups, and raising a vote for each. I believe this will allow us to focus better on a development plan.
Thank you,
Daniel -
Jörg Sachse commented
I want to add that we already opened a SupportCase on SalesForce back in 2013 (00033297 "AIP-update on a directory basis", opened 02.10.2013 12:27) that has been in Product Manager Review ever since. It's just a very rough outline of something that has grown towards becoming a requirement, but this is the first time we articulated the idea of treating ingest and AIP update alike.
-
Jens Steidl commented
[cont 3/3]
One issue that arose as a consequence of ingest and AIP update not being processed by the same routines is the issue in SupportCase 00217330 "Feature Request: Allowing relative file paths for “updateRepresentation” web service call". We hit that problem while moving the pre-ingest processing away from the Rosetta application servers and spreading it over several individual VMs (one for each producer) to improve scalability and reduce load on the Rosetta servers. Quote: "A new Ingest triggered by a call to <submitDepositActivity> already requires a relative path to the subdirectory (SIP) and poses no problem. Rosetta looks for the provided folder name relative to the path specified in the submission format. To update files of an existing IE we use <updateRepresentation>. This web service call does not support relative file paths. We would like the AIP to support relative file paths in this use case too." The current behaviour has damaged IEs in the past, which requires additional effort and support. The suggestion is obvious: treat ingest and AIP update alike, so relative paths are supported in any case.
Overall, we got the impression that, contrary to the ingest, AIP update has been developed with low effort on the design part of the feature. There are several inconsistencies and bugs that require workarounds or bugs that endanger the repository's consistency. We'd love to see the whole feature redesigned and reworked with the highest priority on software quality, a consistent API and with thought-out, predictable behaviour. Also, we'd really appreciate support for this idea from other customers.
The issue has already been discussed on the Listserv by Andreas Romeyke (who introduced the idea there), Opher Kutner, Franziska Geisser and Margaret Barram (starting Thursday, December 08, 2016 6:50 PM, Subject: "How an AIPUpdate should work, some thoughts, was: Enhance Update MD web service to support DNX update").
-
Jens Steidl commented
[cont 2/3]
Seeing the AIP update on a higher level, there are also conceptual issues with Rosetta's AIP update. Currently, new ingests and AIP updates are handled completely different. For new ingests via web service API, producers create a SIP XML file that describes the content of the SIP and one or more DublinCore XML files that contain basic descriptive MD for the IEs contained in this SIP. On ingest, Rosetta will read those two files, copy the metadata for the SIP and IEs and move the files through ingest processing and further into the permanent repository. For AIP updates, however, the producers will have to create separate data structures to call the updateMD() web service with. These data structures contain information about the representation that is due to be updated, the files that are relevant for the update, the update operations for each file (add, replace, remove) and accompanying checksums. For these operations, separate API calls are used and ExLibris has to maintain separate code routines to handle the different workflows. Instead, the whole process could be hugely simplified by using the same routines for ingest and updates alike. Producers would deliver packages that resemble SIPs for their AIP updates and let Rosetta figure out which files need updating. As all API users face this challenge, it would be beneficial if this functionality would only have to be implemented in Rosetta, thus reducing development effort for the customers. All that is needed for updates via SIP packages is the fileOriginalName and the file checksum, both of which can be recovered from the Rosetta METS that is passed to Rosetta during ingest:
- An ADD operation is run if both of the following conditions are met. 1. The fileSection in the metadata contains a filename that is not currently part of the existing IE. 2. The file's checksum does not resemble the checksum of any of the checksums for file currently contained in the existing IE. ()
- A REPLACE operation is run if any one of the following conditions are met: 1. The fileSection in the metadata contains a filename that is already in the existing IE, but the checksums differ. (Here, the file and its checksums would be replaced.) 2. The fileSection in the metadata contains a filename that is not currently part of the existing IE, but the checksums of the old file and the new file match. (Here, the filename would be altered, but the file itself would not.)
- A DELETE operation is run if both of the following conditions are met. 1. The fileSection in the metadata lacks a filename that previously existed in the existing IE. 2. The file is not physically present in the new SIP. Alternatively, a DELETE operation could be triggered by listing the file in the fileSection, but leaving the checksum field empty and leaving the physical file out of the SIP.
In this scenario, new ingests would just be a series of ADD operations into a new and empty representation, combined with an update on an empty set of metadata. Any ingest/update API call would need to contain the affected representation, just like it's already implemented for the AIP update.
The advantages are pretty clear: maintenance effort could be reduced for both ExLibris and for its customers. ExLibris wouldn't have to maintain the separate and still buggy routines that are currently used for AIP update, and customers could use identical routines for ingests and updates, thus eliminating the need to implement the error prone process of AIP update that is currently needed. Processes would be easier to learn, follow and debug, as complexity drops. Software testing would become easier and faster. AIP update could be done both on the full IE and in a differential manner using the same codebase.SLUB actually has a (rather imprecise) SupportCase for this issue (00033297 "AIP-update on a directory basis", last updated 3.5y ago, Product Manager Review), but we haven't used this case to go into any more detail, mainly because back then we didn't have an own concept in mind of how this should work.
-
Jens Steidl commented
[cont 1/3]
Another issue that is connected to that is the difference in handling representation updates and metadata updates. While metadata updates usually are small and are handled in a synchronous manner, the larger file updates are handled in an asynchronous manner, meaning that the web service API will return even if the actual processing hasn't finished yet. The problem with this approach is that, from a producer's point of view, updates are split up into separate operations and create multiple new IE versions. I want to use an example to explain in more detail. Let's assume that you have a set of TIFF images and a metadata file containing descriptive metadata as well as fixity metadata (and some more) in your permanent repository. For our first scenario, lets assume that you want to update the IE's dc.title. That means that the metadata file in the preservation master representation will be updated as well. Due to this behaviour, the AIP needs to be updated in two steps: 1. metadata update to change the IE's dc.title, 2. updateRepresentation() on the metadata file to replace the preservation master. Now, here's our problem: these two steps are separate transactions in the Rosetta-AIP-update-sense, and will create two versions. In reality, however, there are no two separate versions. There's only one new version of the metadata file with a new dc.title. If, for whatever reason, the second step fails, that leaves us with an IE that is semantically inconsistent, even though one new Rosetta IE version has been created. The metadata will reflect a new status (with the dc.title already updated), but the metadata file is still in its original version. This state can hardly be detected at all, making troubleshooting a cumbersome endeavour. The problem gets even bigger when we consider a more real life example as shown in the second scenario. Let's assume that we want to update master images in two representations (i.e. master scans, OCR information, derivative copies, reference color wedge image) as well as descriptive metadata. That means that we have to run four updates on our AIP, thus creating four versions, three of which are useless for our needs because they describe intermediate states. The update would run as follows: 1. update descriptive DC metadata, 2. update source metadata, 3. update 1st REP, 4. update 2nd REP. Again, if any of these updates doesn't run, we might end up with a semantically inconsistent IE that looks absolutely fine to Rosetta because it doesn't have any pending transactions and is not locked. Also, even if all of these updates run smoothly, it will become hard to revert back to semantically consistent versions of the IE, because there are way more IE versions that actual versions. That whole complex is currently one of our biggest issues with AIP updates.
After revising our ingest policy, we now have to add new sourceMD sections to the majority of our existing AIPs. We thought of using the updateMd() API for updating the AIP, so we created some code to call updateMd() with an empty MID, which indeed creates a new sourceMD section. So, even though this approach works, the behaviour is neither documented nor officially supported (quote S. Sterenberg, ExL: "Rosetta is currently not designed to provide the functionality you requested."). APIs to add new sub-nodes to existing source MD sections and to delete entire source MD sections or sub-nodes of source MD sections do not exist at all. Deletion can only be done manually via the WebUI.
We currently have a SupportCase with ExLibris on this issue (00349470 "Adding new metadata entries with updateMD web service [Rosetta 5.0.1.1]"), but as long as this feature isn't available, "real" AIP updates for ca. 57,000 AIPs will fail due to the missing sourceMD fields.There is already an idea that is related to this part of the idea and concerns DNX MD updates. We'd love to see other customers support that idea as well. (see idea "Enhance UpdateMD web service to support DNX update", http://ideas.exlibrisgroup.com/forums/308179-rosetta/suggestions/17356324-enhance-updatemd-web-service-to-support-dnx-update)
-
Jens Steidl commented
[cont 1/3]
Another issue that is connected to that is the difference in handling representation updates and metadata updates. While metadata updates usually are small and are handled in a synchronous manner, the larger file updates are handled in an asynchronous manner, meaning that the web service API will return even if the actual processing hasn't finished yet. The problem with this approach is that, from a producer's point of view, updates are split up into separate operations and create multiple new IE versions. I want to use an example to explain in more detail. Let's assume that you have a set of TIFF images and a metadata file containing descriptive metadata as well as fixity metadata (and some more) in your permanent repository. For our first scenario, lets assume that you want to update the IE's dc.title. That means that the metadata file in the preservation master representation will be updated as well. Due to this behaviour, the AIP needs to be updated in two steps: 1. metadata update to change the IE's dc.title, 2. updateRepresentation() on the metadata file to replace the preservation master. Now, here's our problem: these two steps are separate transactions in the Rosetta-AIP-update-sense, and will create two versions. In reality, however, there are no two separate versions. There's only one new version of the metadata file with a new dc.title. If, for whatever reason, the second step fails, that leaves us with an IE that is semantically inconsistent, even though one new Rosetta IE version has been created. The metadata will reflect a new status (with the dc.title already updated), but the metadata file is still in its original version. This state can hardly be detected at all, making troubleshooting a cumbersome endeavour. The problem gets even bigger when we consider a more real life example as shown in the second scenario. Let's assume that we want to update master images in two representations (i.e. master scans, OCR information, derivative copies, reference color wedge image) as well as descriptive metadata. That means that we have to run four updates on our AIP, thus creating four versions, three of which are useless for our needs because they describe intermediate states. The update would run as follows: 1. update descriptive DC metadata, 2. update source metadata, 3. update 1st REP, 4. update 2nd REP. Again, if any of these updates doesn't run, we might end up with a semantically inconsistent IE that looks absolutely fine to Rosetta because it doesn't have any pending transactions and is not locked. Also, even if all of these updates run smoothly, it will become hard to revert back to semantically consistent versions of the IE, because there are way more IE versions that actual versions. That whole complex is currently one of our biggest issues with AIP updates.
After revising our ingest policy, we now have to add new sourceMD sections to the majority of our existing AIPs. We thought of using the updateMd() API for updating the AIP, so we created some code to call updateMd() with an empty MID, which indeed creates a new sourceMD section. So, even though this approach works, the behaviour is neither documented nor officially supported (quote S. Sterenberg, ExL: "Rosetta is currently not designed to provide the functionality you requested."). APIs to add new sub-nodes to existing source MD sections and to delete entire source MD sections or sub-nodes of source MD sections do not exist at all. Deletion can only be done manually via the WebUI.
We currently have a SupportCase with ExLibris on this issue (00349470 "Adding new metadata entries with updateMD web service [Rosetta 5.0.1.1]"), but as long as this feature isn't available, "real" AIP updates for ca. 57,000 AIPs will fail due to the missing sourceMD fields.There is already an idea that is related to this part of the idea and concerns DNX MD updates. We'd love to see other customers support that idea as well. (see idea "Enhance UpdateMD web service to support DNX update", http://ideas.exlibrisgroup.com/forums/308179-rosetta/suggestions/17356324-enhance-updatemd-web-service-to-support-dnx-update)
-
Jörg Sachse commented
[cont 1/3]
Another issue that is connected to that is the difference in handling representation updates and metadata updates. While metadata updates usually are small and are handled in a synchronous manner, the larger file updates are handled in an asynchronous manner, meaning that the web service API will return even if the actual processing hasn't finished yet. The problem with this approach is that, from a producer's point of view, updates are split up into separate operations and create multiple new IE versions. I want to use an example to explain in more detail. Let's assume that you have a set of TIFF images and a metadata file containing descriptive metadata as well as fixity metadata (and some more) in your permanent repository. For our first scenario, lets assume that you want to update the IE's dc.title. That means that the metadata file in the preservation master representation will be updated as well. Due to this behaviour, the AIP needs to be updated in two steps: 1. metadata update to change the IE's dc.title, 2. updateRepresentation() on the metadata file to replace the preservation master. Now, here's our problem: these two steps are separate transactions in the Rosetta-AIP-update-sense, and will create two versions. In reality, however, there are no two separate versions. There's only one new version of the metadata file with a new dc.title. If, for whatever reason, the second step fails, that leaves us with an IE that is semantically inconsistent, even though one new Rosetta IE version has been created. The metadata will reflect a new status (with the dc.title already updated), but the metadata file is still in its original version. This state can hardly be detected at all, making troubleshooting a cumbersome endeavour. The problem gets even bigger when we consider a more real life example as shown in the second scenario. Let's assume that we want to update master images in two representations (i.e. master scans, OCR information, derivative copies, reference color wedge image) as well as descriptive metadata. That means that we have to run four updates on our AIP, thus creating four versions, three of which are useless for our needs because they describe intermediate states. The update would run as follows: 1. update descriptive DC metadata, 2. update source metadata, 3. update 1st REP, 4. update 2nd REP. Again, if any of these updates doesn't run, we might end up with a semantically inconsistent IE that looks absolutely fine to Rosetta because it doesn't have any pending transactions and is not locked. Also, even if all of these updates run smoothly, it will become hard to revert back to semantically consistent versions of the IE, because there are way more IE versions that actual versions. That whole complex is currently one of our biggest issues with AIP updates.
After revising our ingest policy, we now have to add new sourceMD sections to the majority of our existing AIPs. We thought of using the updateMd() API for updating the AIP, so we created some code to call updateMd() with an empty MID, which indeed creates a new sourceMD section. So, even though this approach works, the behaviour is neither documented nor officially supported (quote S. Sterenberg, ExL: "Rosetta is currently not designed to provide the functionality you requested."). APIs to add new sub-nodes to existing source MD sections and to delete entire source MD sections or sub-nodes of source MD sections do not exist at all. Deletion can only be done manually via the WebUI.
We currently have a SupportCase with ExLibris on this issue (00349470 "Adding new metadata entries with updateMD web service [Rosetta 5.0.1.1]"), but as long as this feature isn't available, "real" AIP updates for ca. 57,000 AIPs will fail due to the missing sourceMD fields.There is already an idea that is related to this part of the idea and concerns DNX MD updates. We'd love to see other customers support that idea as well. (see idea "Enhance UpdateMD web service to support DNX update", http://ideas.exlibrisgroup.com/forums/308179-rosetta/suggestions/17356324-enhance-updatemd-web-service-to-support-dnx-update)
-
Jörg Sachse commented
[cont 2/3]
Seeing the AIP update on a higher level, there are also conceptual issues with Rosetta's AIP update. Currently, new ingests and AIP updates are handled completely different. For new ingests via web service API, producers create a SIP XML file that describes the content of the SIP and one or more DublinCore XML files that contain basic descriptive MD for the IEs contained in this SIP. On ingest, Rosetta will read those two files, copy the metadata for the SIP and IEs and move the files through ingest processing and further into the permanent repository. For AIP updates, however, the producers will have to create separate data structures to call the updateMD() web service with. These data structures contain information about the representation that is due to be updated, the files that are relevant for the update, the update operations for each file (add, replace, remove) and accompanying checksums. For these operations, separate API calls are used and ExLibris has to maintain separate code routines to handle the different workflows. Instead, the whole process could be hugely simplified by using the same routines for ingest and updates alike. Producers would deliver packages that resemble SIPs for their AIP updates and let Rosetta figure out which files need updating. As all API users face this challenge, it would be beneficial if this functionality would only have to be implemented in Rosetta, thus reducing development effort for the customers. All that is needed for updates via SIP packages is the fileOriginalName and the file checksum, both of which can be recovered from the Rosetta METS that is passed to Rosetta during ingest:
- An ADD operation is run if both of the following conditions are met. 1. The fileSection in the metadata contains a filename that is not currently part of the existing IE. 2. The file's checksum does not resemble the checksum of any of the checksums for file currently contained in the existing IE. ()
- A REPLACE operation is run if any one of the following conditions are met: 1. The fileSection in the metadata contains a filename that is already in the existing IE, but the checksums differ. (Here, the file and its checksums would be replaced.) 2. The fileSection in the metadata contains a filename that is not currently part of the existing IE, but the checksums of the old file and the new file match. (Here, the filename would be altered, but the file itself would not.)
- A DELETE operation is run if both of the following conditions are met. 1. The fileSection in the metadata lacks a filename that previously existed in the existing IE. 2. The file is not physically present in the new SIP. Alternatively, a DELETE operation could be triggered by listing the file in the fileSection, but leaving the checksum field empty and leaving the physical file out of the SIP.
In this scenario, new ingests would just be a series of ADD operations into a new and empty representation, combined with an update on an empty set of metadata. Any ingest/update API call would need to contain the affected representation, just like it's already implemented for the AIP update.
The advantages are pretty clear: maintenance effort could be reduced for both ExLibris and for its customers. ExLibris wouldn't have to maintain the separate and still buggy routines that are currently used for AIP update, and customers could use identical routines for ingests and updates, thus eliminating the need to implement the error prone process of AIP update that is currently needed. Processes would be easier to learn, follow and debug, as complexity drops. Software testing would become easier and faster. AIP update could be done both on the full IE and in a differential manner using the same codebase.SLUB actually has a (rather imprecise) SupportCase for this issue (00033297 "AIP-update on a directory basis", last updated 3.5y ago, Product Manager Review), but we haven't used this case to go into any more detail, mainly because back then we didn't have an own concept in mind of how this should work.
-
Jörg Sachse commented
[cont 3/3]
One issue that arose as a consequence of ingest and AIP update not being processed by the same routines is the issue in SupportCase 00217330 "Feature Request: Allowing relative file paths for “updateRepresentation” web service call". We hit that problem while moving the pre-ingest processing away from the Rosetta application servers and spreading it over several individual VMs (one for each producer) to improve scalability and reduce load on the Rosetta servers. Quote: "A new Ingest triggered by a call to <submitDepositActivity> already requires a relative path to the subdirectory (SIP) and poses no problem. Rosetta looks for the provided folder name relative to the path specified in the submission format. To update files of an existing IE we use <updateRepresentation>. This web service call does not support relative file paths. We would like the AIP to support relative file paths in this use case too." The current behaviour has damaged IEs in the past, which requires additional effort and support. The suggestion is obvious: treat ingest and AIP update alike, so relative paths are supported in any case.
Overall, we got the impression that, contrary to the ingest, AIP update has been developed with low effort on the design part of the feature. There are several inconsistencies and bugs that require workarounds or bugs that endanger the repository's consistency. We'd love to see the whole feature redesigned and reworked with the highest priority on software quality, a consistent API and with thought-out, predictable behaviour. Also, we'd really appreciate support for this idea from other customers.
The issue has already been discussed on the Listserv by Andreas Romeyke (who introduced the idea there), Opher Kutner, Franziska Geisser and Margaret Barram (starting Thursday, December 08, 2016 6:50 PM, Subject: "How an AIPUpdate should work, some thoughts, was: Enhance Update MD web service to support DNX update").
-
Jörg Sachse commented
[cont 1/3]
Another issue that is connected to that is the difference in handling representation updates and metadata updates. While metadata updates usually are small and are handled in a synchronous manner, the larger file updates are handled in an asynchronous manner, meaning that the web service API will return even if the actual processing hasn't finished yet. The problem with this approach is that, from a producer's point of view, updates are split up into separate operations and create multiple new IE versions. I want to use an example to explain in more detail. Let's assume that you have a set of TIFF images and a metadata file containing descriptive metadata as well as fixity metadata (and some more) in your permanent repository. For our first scenario, lets assume that you want to update the IE's dc.title. That means that the metadata file in the preservation master representation will be updated as well. Due to this behaviour, the AIP needs to be updated in two steps: 1. metadata update to change the IE's dc.title, 2. updateRepresentation() on the metadata file to replace the preservation master. Now, here's our problem: these two steps are separate transactions in the Rosetta-AIP-update-sense, and will create two versions. In reality, however, there are no two separate versions. There's only one new version of the metadata file with a new dc.title. If, for whatever reason, the second step fails, that leaves us with an IE that is semantically inconsistent, even though one new Rosetta IE version has been created. The metadata will reflect a new status (with the dc.title already updated), but the metadata file is still in its original version. This state can hardly be detected at all, making troubleshooting a cumbersome endeavour. The problem gets even bigger when we consider a more real life example as shown in the second scenario. Let's assume that we want to update master images in two representations (i.e. master scans, OCR information, derivative copies, reference color wedge image) as well as descriptive metadata. That means that we have to run four updates on our AIP, thus creating four versions, three of which are useless for our needs because they describe intermediate states. The update would run as follows: 1. update descriptive DC metadata, 2. update source metadata, 3. update 1st REP, 4. update 2nd REP. Again, if any of these updates doesn't run, we might end up with a semantically inconsistent IE that looks absolutely fine to Rosetta because it doesn't have any pending transactions and is not locked. Also, even if all of these updates run smoothly, it will become hard to revert back to semantically consistent versions of the IE, because there are way more IE versions that actual versions. That whole complex is currently one of our biggest issues with AIP updates.
After revising our ingest policy, we now have to add new sourceMD sections to the majority of our existing AIPs. We thought of using the updateMd() API for updating the AIP, so we created some code to call updateMd() with an empty MID, which indeed creates a new sourceMD section. So, even though this approach works, the behaviour is neither documented nor officially supported (quote S. Sterenberg, ExL: "Rosetta is currently not designed to provide the functionality you requested."). APIs to add new sub-nodes to existing source MD sections and to delete entire source MD sections or sub-nodes of source MD sections do not exist at all. Deletion can only be done manually via the WebUI.
We currently have a SupportCase with ExLibris on this issue (00349470 "Adding new metadata entries with updateMD web service [Rosetta 5.0.1.1]"), but as long as this feature isn't available, "real" AIP updates for ca. 57,000 AIPs will fail due to the missing sourceMD fields.There is already an idea that is related to this part of the idea and concerns DNX MD updates. We'd love to see other customers support that idea as well. (see idea "Enhance UpdateMD web service to support DNX update", http://ideas.exlibrisgroup.com/forums/308179-rosetta/suggestions/17356324-enhance-updatemd-web-service-to-support-dnx-update)
-
Jörg Sachse commented
[cont 2/3]
Seeing the AIP update on a higher level, there are also conceptual issues with Rosetta's AIP update. Currently, new ingests and AIP updates are handled completely different. For new ingests via web service API, producers create a SIP XML file that describes the content of the SIP and one or more DublinCore XML files that contain basic descriptive MD for the IEs contained in this SIP. On ingest, Rosetta will read those two files, copy the metadata for the SIP and IEs and move the files through ingest processing and further into the permanent repository. For AIP updates, however, the producers will have to create separate data structures to call the updateMD() web service with. These data structures contain information about the representation that is due to be updated, the files that are relevant for the update, the update operations for each file (add, replace, remove) and accompanying checksums. For these operations, separate API calls are used and ExLibris has to maintain separate code routines to handle the different workflows. Instead, the whole process could be hugely simplified by using the same routines for ingest and updates alike. Producers would deliver packages that resemble SIPs for their AIP updates and let Rosetta figure out which files need updating. As all API users face this challenge, it would be beneficial if this functionality would only have to be implemented in Rosetta, thus reducing development effort for the customers. All that is needed for updates via SIP packages is the fileOriginalName and the file checksum, both of which can be recovered from the Rosetta METS that is passed to Rosetta during ingest:
- An ADD operation is run if both of the following conditions are met. 1. The fileSection in the metadata contains a filename that is not currently part of the existing IE. 2. The file's checksum does not resemble the checksum of any of the checksums for file currently contained in the existing IE. ()
- A REPLACE operation is run if any one of the following conditions are met: 1. The fileSection in the metadata contains a filename that is already in the existing IE, but the checksums differ. (Here, the file and its checksums would be replaced.) 2. The fileSection in the metadata contains a filename that is not currently part of the existing IE, but the checksums of the old file and the new file match. (Here, the filename would be altered, but the file itself would not.)
- A DELETE operation is run if both of the following conditions are met. 1. The fileSection in the metadata lacks a filename that previously existed in the existing IE. 2. The file is not physically present in the new SIP. Alternatively, a DELETE operation could be triggered by listing the file in the fileSection, but leaving the checksum field empty and leaving the physical file out of the SIP.
In this scenario, new ingests would just be a series of ADD operations into a new and empty representation, combined with an update on an empty set of metadata. Any ingest/update API call would need to contain the affected representation, just like it's already implemented for the AIP update.
The advantages are pretty clear: maintenance effort could be reduced for both ExLibris and for its customers. ExLibris wouldn't have to maintain the separate and still buggy routines that are currently used for AIP update, and customers could use identical routines for ingests and updates, thus eliminating the need to implement the error prone process of AIP update that is currently needed. Processes would be easier to learn, follow and debug, as complexity drops. Software testing would become easier and faster. AIP update could be done both on the full IE and in a differential manner using the same codebase.SLUB actually has a (rather imprecise) SupportCase for this issue (00033297 "AIP-update on a directory basis", last updated 3.5y ago, Product Manager Review), but we haven't used this case to go into any more detail, mainly because back then we didn't have an own concept in mind of how this should work.