Remove CDI constant expansion of results
PCI results respect that a user knows what they want, but help with expansion when necessary, such as by stemming when there are very few results returned by the exact query. A user has autonomy also to target their results with techniques such as quotation marks, Boolean operators, and Advanced Search.
In CDI, expansion is constant, by term inflection applied to all searches, as well as higher recall in general by design. This cannot be prevented by features such as Boolean operators, quotation marks or Advanced Search, and is illogical in conjunction with other features like 'Did you mean'.
Note: It is recognized that these may represent different underlying mechanisms, but it is the user outcome which is key.
Ex Libris states that CDI behaviour "does not affect precision" because of Verbatim Match Boost, which will rank more highly the exact search query, but they have designed this as also meaning "near verbatim", completely undermining this concept.
Use case: If a user searches for ATLA, then they expect only results for ATLA and not atlas. CDI returns 7 of 10 first page results which are clearly returned on the basis of atlas by term highlighting, even at No.1, and yet also offers a "Did you mean: atlas", which barely changes the results when clicked.
Use case: If a user searches for a DOI, then they expect only that specific resource. CDI returns dozens of results with often no indication by term highlighting or snippets to explain why. This is discovered only after a timeconsuming check on the full text to be because the DOI is in the Reference List. There is no clear pathway to the actually correct known item, and this is not consistently fixed just by ranking changes. For example, if we do not hold that article in full text, the user sees dozens of results, none of which are correct, instead of the expected pathway of the Zero results message, using the expansion checkbox, and then leveraging the metadata in the 'No full-text' citation to prefill an ILL form.
This is the opposite of precision.
The design should centre the user and the needs they directly express when entering their search query, allowing the choice to both target their search by the techniques above, as well as giving the option of expanding their search, which is even more important given the larger CDI index.
One option could be returning the expected targeted results matching the user query, and then offering a suggestion similar to a clickable Did you mean or Controlled Vocabulary features, such as: "Results also referencing [query]"
This improvement is planned for the 2024 Roadmap.
-
Manu_Schwendener commented
Can this be closed as fixed with the November 2024 release?
-
Stacey van Groll commented
This is the scoping statement by Primo Product Management for the 2023 enhancements round, which is important to consider. It is different in some ways to my submission here, but I am likely to consider it fulfilled with the delivery of #8210 (as generated by my #7683 submission which was No.1 but rejected for scoping in 2022). This is pending practical experience of testing once it's available.
"The request is to use verbatim search when quotation marks are used, today in CDI we are using a payload mechanism where we index one variation of a word and using the original variations for ranking - for example we can index only student even if the title includes students in plural or composing vs composition ...
this mechanism was identified as the main cause to the variations of search results that appear even when using """".
Today in CDI, we apply language-specific text analyzers to both indexed terms and search terms. For example, the term “students” appearing a record will be indexed as “student”. The current payload mechanism as discussed is used to support the current verbatim match boosting feature meaning boosts the relevance scores of verbatim matches between search terms and indexed terms, without excluding non-verbatim matches.
The proposed enhancement is to modify CDI’s search algorithm to use the same payload mechanism to limit the matches to verbatim matches only when double quotes are used. meaning in this cases we will ignore other variations.Solution will be focused in improving CDI searches in the following use cases:
using quotation mark """"
* using ""is (exact) in the Advanced Search
* In these cases CDI will do matching based on the original terms and will minimize cases caused by the payload mechanism.The estimation for this solution is :
* Covering 90% of cases by improving matching using """" - 35 points
Additional information:
* As mentioned this solution will be limited to searches on metadata fields -> it will not include full text searches ( as we don't save the original terms to exclude in case of """" )
* Also to set expectations : differences in diacritics and other character variations will be considered to be non-verbatim matches. They will not be cross-searchable with double quotes. Examples:
** fiancé vs. fiance
** 大學 vs. 大学
Note: casing differences will not be considered to be non-verbatim matches. They will remain cross-searchable with double quotes. For example: University vs. university or AIDS vs aids
This is so far the solution we intend to implement and estimated as 35 points* Handling additional edge cases like using hyphen and more... - Additional of 20 points
Additional information:
During the analysis phase we also identify special cases that require additional effort estimated as additional 20 points for example:
* Compound words (e.g., workplace)
* Hyphen, and other punctuation marks.
* Handling languages with special functionality like Chinese, Japanese, Hebrew, and German
* Special cases of CJK text included in non-CJK records
In any case we would like to mention the limitation of full text searches - if using the include full text searches option- users may still get results with the variations coming from full text. changing this also to support full text matching will be highly cost and may impact indexing size in CDI." -
Manu Schwendener commented
Made it through round 2 of NERS and should be possible by autumn 2024.
-
Manu Schwendener commented
NERS 8210, round 2 open for voting now.
-
Ethan55 commented
Good information Thank you !
-
Manu Schwendener commented
NERS 8210, open for voting now.
-
Manu Schwendener commented
> In CDI, expansion is constant, by term inflection applied to all searches
As a side effect, this also leads to incrompehensible/confusing numbers of hits with truncated search:
faultier 412 hits
https://basel.swisscovery.org/discovery/search?query=any,contains,faultier&tab=41SLSP_CDI&search_scope=CentralIndex&vid=41SLSP_UBS:live&lang=en&offset=0faultier* 271 hits
https://basel.swisscovery.org/discovery/search?query=any,contains,faultier*&tab=41SLSP_CDI&search_scope=CentralIndex&vid=41SLSP_UBS:live&lang=en&offset=0---
The problem is documented, but patrons can't know that
"A wildcard search does not necessarily return more results than the same search without the wildcard. This is because CDI’s multilingual search features (such as stemming/lemmatization, synonym mapping and spelling normalization) are not applied to wildcard searches."
-
Manu Schwendener commented
> In CDI, expansion is constant, by term inflection applied to all searches
We just had a patron stumble over this.
Not only does he get the impression that search in our catalog is broken, but he also waisted a lot of time chasing an article that in the end turned out to NOT FIT his search criteria.
-
Stacey van Groll commented
The set of 3 ideas which would drastically improve irrelevant and meaningless CDI results, by restoring and adding search tools which empower our users to target their search and their results, and and fixing the design decisions which make these tools very necessary:
-
Stacey van Groll commented
A user story showing one of the issues with this design:
I am a user interested in new resources the exact terms “student consult”, and I’m pleased to find that my Library offers a feature of a weekly Saved Search Alert email, as I’m time poor.
The next week, I get a saved search alert email for a single new item returned by my query, and I’m excited to explore this resource that my Library has sent to me.
I click on the link to navigate to the record, but I’m surprised to see that the record doesn’t appear to have my exact terms, with nothing in the record matching this.
I’m confused, and so I navigate to the full text, and Ctrl-F to search for my terms, but I don’t get any hits on “student consult”.
I change my search to just “student”, and then I finally see that the full text includes text of: "Ask students to consult the literature…"
I am extremely annoyed, because I explicitly set up a search query exactly for “student consult” as I know this is what quotation marks should do to target queries, and I feel like my institution’s library has wasted my time.Per Ex Libris documentation, this outcome is expected, because stop words are not indexed in the full text and quotation marks do not present expansion.
So, “student” is expanded to “students” and the presence of “to” in the full text is ignored, meaning that “student consult” matches to “students to consult”.
Ex Libris recognises this is a problem in the OLH ie “On the downside, they contribute to a longer tail of results that may be less or not relevant to the users’ intentions.”
But they also think that this is acceptable: “As full text matches are ranked far lower than metadata matches, material with the exact phrase in the metadata will almost always outrank them in the result list. However, full text matches can become important if there are no or very few results with the exact phrase in the metadata, and it can lead to other relevant findings.”
The assumptions that Ex Libris is making here, all of which are false:
• getting no results or few results is always a bad thing, which must be avoided at all costs
• users will not want to sort their results
• users will not want to use any facets
• users are only searching in UI manually every time, and not setting up saved search alerts
In sum, it is assumed that the only way Primo is being used is by a search, with relevance ranking, and that users only care about the top results in Primo, and therefore CDI design is 'working as expected'.
“Some” users are served by this, and perhaps you could argue even the majority, but the needs of experienced researchers are ignored and apparently considered unimportant.
Primo should be sophisticated enough to support the needs of all users.
It is a regression and downgrade in the service offered by our Library. -
Denise Green, CARLI Illinois commented
I agree, the CDI needs more options for focusing and precision.
-
Stacey van Groll commented
Some user-focused reasons to vote:
* Do you get complaints about the deluge of irrelevant results?
* Would you like your experienced researchers to be able to find exactly what they need by their targeted query, with the use of Boolean operators, quotation marks, and Advanced Search?
* Would you like these users to be able to sort their results for review other than by relevance (not possible with the long tail), and take full advantage of features like Saved Search Alerts? -
A Rowe commented
Researchers often want everything on a topic. Expanding results gives them unnecessary additions to filter through. Having a way to search without search expansion would greatly improve the researcher experience.
-
Katharina Wolkwitz commented
It would be nice to be able to answer the "Did you mean: [xxx]"-question?" with a simple "no", which results in a search for just what the user entered in the search-field.
Stemming and synonyms are all very nice and possible helpful, but this were ridiculus if it were not so demeaning and invasive. It takes the whole descision of what to search out of the users hands!
The user should always have the choice to state "I am sure that I meant exactly what I typed in that field!"
-
Knut A Bøckman commented
Excellent idea, and convincingly argued. Thanks for posting; there went my last votes (only 2, unfortunately)
-
François Renaville commented
Thanks for submitting this idea, Stacey. We have received complains from staff and patrons about the constant expansion.