issues with indexing in Rosetta
Here at SLUB, we're seeing a few issues with the way indexing is implemented in Rosetta.
Currently, all Rosetta servers that are in the "Indexer" role hold a single part of the whole index with no redundancy between the indexing nodes. If there's an outage in one of the servers, there are two effects.
N°1: One third of the index entries will not be available and the missing server will not take part in indexing any new IEs, obviously. The IE that was just being processed by this server will not be in the index and information about it will not be available, even though it has already reached the permanent repository and is in status "module PERMANENT, stage FINISHED, status FINISHED".
N°2: As the remaining Rosetta servers continue to work like nothing happened, they might mistake an AIP Update coming from the Submission Application for a new ingest, thus generating duplicate IEs. This is due to the fact that Submission Applications can only rely on the data returned by the web service API, which in turn queries the index. Incomplete index data will hence lead to wrong behaviour in the Submission Application. This can result in duplicates of IEs with the same descriptive MD and external identifiers that represent different versions of the same package coming from the production workflow, some of which may even be incomplete due to differential AIP updates.
Reliable index data is extremely important for controlling fully automated workflows. SRU is the only way to find the IE PID for a given external identifier. The IE PID, in turn, is needed to call the updateRepresentation() and exportIE() APIs. As long as there's no API call that returns the IE PID for a given external ID by querying the database and not relying on the index, the Solr servers need to be highly available in order for any automated workflow to function flawlessly. Apart from the problems with ingest/AIP update mentioned earlier, incomplete index contents will also lead to wrong reporting results, which in turn leads to wrong billing and status information being delivered to tenants/3rd parties. This is even more important when you take the announced changes in Rosetta 5.3 into consideration, where indexing will be expanded to include SIP MD.
To date, there's no easy way of knowing that one of the indexing services has crashed or is unavailable. There's neither a warning or an UI notification given to the user like on other errors, nor (and this is even more relevant for highly automated workflows like ours) via SNMP where we could use established monitoring tools to monitor service health. An inoperative or incomplete index is not usable for automatic ingest workflows and should make the Submission Application pause all ingests immediately.
Also, there's no redundancy between the local indexes held on the indexing servers. This is highly undesirable from an operations point of view, as redundancy is key when working to increase availability. Luckily, Solr already has all the prerequisites for failover built into it, so ExLibris should be able to implement an appropriate solution without any negative impact and with minimum effort. More information can be found at https://wiki.apache.org/solr/SolrReplication.
We are aware that the Index Status can be seen at "Rosetta Administration → Repository → Index Status", but it's rather buried there and critical failures are not visible at first sight on the Dashboard. During the last outage, only our own homebrew server.log error monitoring script gave a warning the day after by sending us the relevant error message from the log.
This idea is based on Support case #00374004, Title: "Solr server down [Rosetta 220.127.116.11]" and #00173827 "indexing jobs not balanced [Rosetta 18.104.22.168]".
Michelle Lindlar commented
Fully agree about the functioning index - my comment was intended to underline that.
It just wasn't clear to me which actions you were proposing - now is, thanks!
Jörg Sachse commented
with a product as complex and, frankly, expensive as Rosetta, we expect to get all of the information needed from the system without having to maintain any external databases that contain data Rosetta already has. Apart from the additional effort of maintaining external systems, there's also the problem of keeping Rosetta in sync with external systems, so this solution is highly undesirable.
As for your questions:
1. Yes, we propose using redundancy between the indexing nodes. Also, there should be an easy, documented and supported way for customers to add dedicated standalone vanilla Solr servers for more flexibility/scalability/bandwidth/redundancy.
2. We'd like to see Rosetta to be as verbose as possible about any error conditions, including indexing errors. This includes partial index failures as well as IE exceptions, though the first certainly are a more serious problem than the latter. However, you have to take into consideration that we use Rosetta mainly through its APIs and we think that the different error conditions (partial index failures vs. IE exceptions vs. any other conditions) should probably not be reported through the same channel/webservice/interface. For more general failures, we have already suggested that ExLibris add an SNMP interface to monitor Rosetta functionalities in a SupportCase (see SC 00157283 "SNMP-Interface for Rosetta [Rosetta 22.214.171.124]" https://exlibrisgroup.my.salesforce.com/5003200000qHgxf for details)
Apart from all that was said before, the Solr component in Rosetta is (afaik) mainly undocumented and not administerable by the customer, which is a shortfall in itself that should be corrected.
Michelle Lindlar commented
We've also seen quite a few indexing problems in the past. Especially when having to rely on combined criteria searches across Permanent for statistics (e.g. for specific workflows, producers or 3rd parties), a fully working index is crucial. As IEs frequently get stuck in the exception queue, we currently keep external statistics to be able to account for all objects. While having an external statistic is most likely advisable in any case, the system should be able to generate correct statistics itself or to be able to make IE exceptions easier to understand / fix.
I certainly support a more robust Index - did I understand you correctly that you are proposing to achieve that via:
1. redundancy between indeing nodes
2. better / more visible error handling (and is this only in case of partial index failures or also for things such as IE exceptions? - we would favor the latter)