Harvesting Zenodo Metadata with OAI-PMH (DataCite v3)¶
Zenodio’s zenodio.harvest
module provides a Pythonic interface to record metadata in a Zenodo community collection.
Zenodio uses the standard OAI-PMH harvesting protocol (and specifically retrieves DataCite v3-flavored XML; it’s the best and latest metadata standard used by Zenodo).
Quick Start¶
To quickly show you harvesting metadata from Zenodo works, we’ll get records from from LSST Data Management’s lsst-dm Community:
You begin by providing the community’s identifier to zenodio.harvest.harvest_collection()
:
import zenodio.harvest
collection = harvest_collection('lsst-dm')
collection
is a zenodio.harvest.Datacite3Collection
instance for the Zenodo community’s record collection.
Use its records()
method to generate Datacite3Record
instances for each record stored in the Zenodo community:
for record in collection.records():
print(record.title)
Or you can get a list of all records:
records = [r for r in collection.records()]
With these Datacite3Record
instances you can access information about individual artifacts on Zenodo through simple class attributes.
For example:
record = records[0]
print(record.title)
print(record.issue_date)
print(record.doi)
print(record.abstract_html)
For information about authors, Zenodio provides an Author
class.
For example:
authors = record.authors
print(','.join([a.last_name for a in authors]))
API Reference¶
Convenience Functions¶
-
zenodio.harvest.
harvest_collection
(community_name)¶ Harvest a Zenodo community’s record metadata.
Examples
You can harvest record metadata for a Zenodo community via its identifier name. For example, the identifier for LSST Data Management’s Zenodo collection is
'lsst-dm'
:>>> import zenodio.harvest import harvest_collection >>> collection = harvest_collection('lsst-dm')
collection
is aDatacite3Collection
instance. Use itsrecords()
method to generateDatacite3Record
objects for individual records in the Zenodo collection.Parameters: community_name (str) – Zenodo community identifier. Returns: collection – The Datacite3Collection
instance with record metadata downloaded from Zenodo.Return type: zenodio.harvest.Datacite3Collection
Metadata Classes¶
-
class
zenodio.harvest.
Datacite3Collection
(xml_records)¶ Zenodo metadata for a Community collection derived from Datacite v3 metadata.
Use the
from_collection_xml()
classmethod to build aDatacite3Collection
from XML obtained from the Zenodo OAI-PMH API. Most likely, users should useharvest_collection()
to build aDatacite3Collection
for a Community.-
classmethod
from_collection_xml
(xml_content)¶ Build a
Datacite3Collection
from Datecite3-formatted XML.Users should use
zenodio.harvest.harvest_collection()
to build aDatacite3Collection
for a Community.Parameters: xml_content (str) – Datacite3-formatted XML content. Returns: collection – The collection parsed from Zenodo OAI-PMH XML content. Return type: Datacite3Collection
-
records
()¶ Yield records from the collection.
Yields: record ( Datacite3Record
) – TheDatacite3Record
for an individual resource in the Zenodo collection.
-
classmethod
-
class
zenodio.harvest.
Datacite3Record
(xml_dict)¶ Zenodo metadata for a single record.
Use
Datacite3Record
s to access metadata about a record though a convient object properties.Parameters: xml_dict ( collections.OrderedDict
) – A dict-like object mapping XML content for a single record (i.e., the contents of therecord
tag in OAI-PMH XML). This dict is typically generated fromxmltodict
.-
abstract_html
¶ Abstract text, marked up with HTML (str).
List of
Author
s (zenodio.harvest.Author
).Authors correspond to creators in the Datacite schema.
-
doi
¶ Digital object identifier str.
-
issue_date
¶ Date when the DOI was issued (
datetime.datetime.Datetime
).
-
title
¶ Title of resource (str).
If there are multiple titles, the first title is returned.
-
-
class
zenodio.harvest.
Author
(last_first, orcid=None, affiliation=None)¶ Metadata about an author.
Author
instances are typically built byDatacite3Record.authors()
.Parameters: - last_first (str) – Author’s name, formatted as ‘Last, First’.
- orcid (str, optional) – Author’s ORCiD.
- affiliation (str, optional) – Author’s affiliation.
-
last_first
¶ str – Author’s name, formatted as ‘Last, First’.
-
orcid
¶ str – Author’s ORCiD.
-
affiliation
¶ str – Author’s affiliation.
-
first_name
¶ Author’s first name (str).
-
classmethod
from_xmldict
(xml_dict)¶ Create an Author from a datacite3 metadata converted by xmltodict.
Parameters: xml_dict ( collections.OrderedDict
) – A dict-like object mapping XML content for a single record (i.e., the contents of therecord
tag in OAI-PMH XML). This dict is typically generated fromxmltodict
.
-
last_name
¶ Author’s last name (str).