Skip to content

Conversation

@jeromeroucou
Copy link
Contributor

@jeromeroucou jeromeroucou commented Sep 12, 2024

What this PR does / why we need it:

This PR allows the harvesting of certain repository who expose metadata with specific namespace.

Some repository extend the "oai_dc" with specific namespace. For example, SEANOE expose specific metadata with dct namespace. Below, the result of https://www.seanoe.org/oai/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:seanoe.org:41307

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dct="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2024-09-12T12:09:21Z</responseDate>
    <request verb="GetRecord" metadataPrefix="oai_dc" identifier="oai:seanoe.org:41307">
        https://www.seanoe.org/oai/OAIHandler</request>
    <GetRecord>
        <record>
            <header>
                <identifier>oai:seanoe.org:41307</identifier>
                <datestamp>2021-05-12</datestamp>
                <setSpec>GROUP:EMSO</setSpec>
                <setSpec>ec_fundedresources</setSpec>
            </header>
            <metadata>
                <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                    xmlns:dc="http://purl.org/dc/elements/1.1/"
                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
                    <dc:title>Iridium GPS 1 data from the EMSO-Azores observatory, 2014-2015</dc:title>
                    <dc:creator>Legrand, Julien</dc:creator>
                    <dc:creator>Sarradin, Pierre-marie</dc:creator>
                    <dc:creator>Cannat, Mathilde</dc:creator>
                    <dc:subject>Mid-Atlantic Ridge</dc:subject>
                    <dc:subject>EMSO</dc:subject>
                    <dc:subject>Lucky Strike</dc:subject>
                    <dc:subject>Time-series</dc:subject>
                    <dc:subject>Environmental monitoring node</dc:subject>
                    <dc:subject>MoMAR</dc:subject>
                    <dc:subject>BOREL</dc:subject>
                    <dc:subject>GPS</dc:subject>
                    <dc:subject>Position</dc:subject>
                    <dc:description>This dataset contains the GPS positions of the EMSO-Azores
                        transmission buoy BOREL acquired between July 2014 and April 2015 using the
                        Iridium/GPS modem 1 (data acquired every 6 hours).</dc:description>
                    <dc:publisher>SEANOE</dc:publisher>
                    <dc:date>2015-10</dc:date>
                    <dc:type>dataset</dc:type>
                    <dc:identifier>DOI:10.17882/41307</dc:identifier>
                    <dc:identifier>https://doi.org/10.17882/41307</dc:identifier>
                    <dc:identifier>https://www.seanoe.org/data/00302/41307/</dc:identifier>
                    <dc:relation>info:eu-repo/grantAgreement/EC/FP7/312463/EU//FIXO3</dc:relation>
                    <dc:coverage>North 37.30134, South 37.2888, East -32.275618, West -32.27982</dc:coverage>
                    <dct:references>https://www.seanoe.org/data/00302/41307/</dct:references>
                    <dcterms:spatial xsi:type="DCTERMS:Box">37.2888 -32.27982 37.30134 -32.275618</dcterms:spatial>
                    <dc:rights>CC-BY</dc:rights>
                </oai_dc:dc>
            </metadata>
        </record>
    </GetRecord>
</OAI-PMH>

Actually, this record can't be harvested because the following exception occurs :

Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix "dct"
  at [row,col {unknown-source}]: [5,555]
       at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634)
       at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:504)
       at com.ctc.wstx.sr.InputElementStack.resolveAndValidateElement(InputElementStack.java:503)
       at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:3066)
       at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2928)
       at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processXMLElement(ImportGenericServiceBean.java:209)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processOAIDCxml(ImportGenericServiceBean.java:180)
       ... 100 more

We propose to ignore everything that is not the dc namespace which means skip the WstxParsingException.

Which issue(s) this PR closes:

No related issue funded

Special notes for your reviewer:

Not really but I've a suggestion to improve the scope of this pull request with another one (or issue) : the ForeignMetadataFormatMapping can be more flexible and can be used for more namespaces than dcterms. With this, we can add a mapping for dct namespace

Suggestions on how to test this:

Add a new harvesting client with https://www.seanoe.org/oai/OAIHandler server and GROUP:EMSO set.
Before the PR, all datasets are in error, with this PR, all datasets are imported.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

A release note snippet has beed added

@coveralls
Copy link

coveralls commented Sep 12, 2024

Coverage Status

coverage: 21.854% (-0.002%) from 21.856%
when pulling baeffdc on Recherche-Data-Gouv:harvest_exclude_invalid_tag
into b28812b on IQSS:develop.

@qqmyers
Copy link
Member

qqmyers commented Sep 12, 2024

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@luddaniel
Copy link
Contributor

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@qqmyers I'm not sure there is a link.

#10837 comes before in the dsDTO = importGenericService.processOAIDCxml(xmlToParse); where we can experience constraints with xml namespaces due to :FastGetRecord xml truncation, dc: prefix requirement and possible XMLStreamException. Also, a generic OAI archive can send a customised oai_dc content like in the example above.

If I missed something, could you shed some light on it for me?

@qqmyers
Copy link
Member

qqmyers commented Sep 25, 2024

Sorry - I agree it's not related. I just saw the note about skipping entries that would fail and wanted to make sure you saw the other PR, but looking at your code I see you're addressing problems in even reading the XML input.

@jeromeroucou jeromeroucou marked this pull request as ready for review September 27, 2024 13:10
@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
@jeromeroucou
Copy link
Contributor Author

Hi @pdurbin ! There is a chance for this small PR to be embedded into 6.5 version ? 🙏

@landreev landreev self-assigned this Nov 8, 2024
@landreev landreev added Feature: Harvesting GREI 2 Consistent Metadata labels Nov 8, 2024
@cmbz cmbz added GREI 3 Search and Browse and removed GREI 2 Consistent Metadata labels Nov 8, 2024
@landreev landreev added GREI 2 Consistent Metadata FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) GREI 3 Search and Browse and removed GREI 3 Search and Browse GREI 2 Consistent Metadata labels Nov 8, 2024
@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2024

@jeromeroucou we moved it to "ready for review". Thanks for the PR!

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromeroucou Thank you for the PR! I am moving it into "ready for QA".
When you have a chance, please sync the branch with develop.
Our plan is to include this change in the 6.5 release next month.

@ofahimIQSS ofahimIQSS self-assigned this Nov 18, 2024
@ofahimIQSS
Copy link
Contributor

testing passed, merging PR[
Testing of 10837.docx
](url)

@ofahimIQSS ofahimIQSS merged commit 42d00d1 into IQSS:develop Nov 19, 2024
@pdurbin pdurbin added this to the 6.5 milestone Nov 19, 2024
@jeromeroucou jeromeroucou deleted the harvest_exclude_invalid_tag branch November 19, 2024 15:59
@ofahimIQSS ofahimIQSS removed their assignment Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature: Harvesting FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) GREI 3 Search and Browse Type: Feature a feature request

Projects

Status: 🚀 Done (Recherche Data Gouv)

Development

Successfully merging this pull request may close these issues.

8 participants