Ticket #16 (closed task: fixed)

Opened 6 years ago

Last modified 5 years ago

Need to determine what format to use for SIP/AIP/DIP (Open access reference model)

Reported by: amit Assigned to: ronald
Priority: critical Milestone:
Component: ingest Version: 0.5-SNAPSHOT
Keywords: AIP SIP DIP ingest Cc: amit
Blocking: Blocked By:

Description (Last modified by amit)

The open access reference model defines three information packages:

  1. Submission Information Package (SIP)
  2. Archival Information Package (AIP)
  3. Dissemination Information Package (DIP)

There are a few standards defined by multiple agencies in this area, the most common one being METS. However even METS is quite generic and we need to come up with a definite profile of the same. This needs to be discussed with PLoS for clarification.

Dependency Graph

Change History

04/24/06 20:37:20 changed by amit

  • status changed from new to assigned.

04/25/06 16:11:10 changed by amit

Talked with Rich on the same and he said he needs to discuss this internally with Rebecca. Told him we are leaning towards PMC and he also agreed given the timeline, but will talk to Rebecca and confirm.

04/25/06 18:53:12 changed by amit

To add clarification, the AIP format is already determined by Fedora and that is FOXML. What we need to determine is what SIP and the DIP are and how the conversion to and fro from FOXML will happen.

04/26/06 11:14:20 changed by amit

Response from AllenPress?:

We're using PMC DTD version 2.0. They (PMC) just did an update to bring the most current version of the DTD to 2.1. I'm not sure when we'll migrate to that as Kevan and the rest of the folks in Electronic Publishing need to have some say in the matter.

Response from John Wilbanks (Science Commons):

But i wanted to point you towards the National Library of Medicine DTD for publications if you don't know it yet. If you guys will support natively i can get you some good partnerships. And it's becoming a real standard - library of congress, british library among others.

I have sent message to Rich asking for Rebecca's opinion on comparison between the two.

04/26/06 18:17:18 changed by amit

Turns out the PMC DTD version 2.0 and NLM DTD are one and the same. Response from Rebecca:

From Rebecca:

PMC DTD 2.0 is the NLM DTD. One and the same.

I have asked Rich to get some clarification from Alan Press on the structure they use for an issue and new articles (zip file, directory structure, etc.).

04/28/06 13:17:41 changed by amit

Response from Rebecca:

To be honest, I'm not going to be much help on this question, as I'm not sure how AP packages things up for PMC. I don't think I've ever seen the directory structure they use. We'll need to ask Susan Dunavan at AP, once she's back from vacation.


We need to forge ahead at least on the article front by assuming PMC 2.0 and make certain assumptions on Issues/Journals (minimalistic API) to save on time till we get more information from Rebecca.

05/01/06 16:07:55 changed by amit

Asked for and recived contact information for folks at PubMed? and have sent them email to both establish contact and establish context for questions to follow. It would save us time if the format created by Allen Press works for Topaz and Allen Press for the October release.

05/10/06 15:43:52 changed by amit

  • status changed from assigned to closed.
  • resolution set to fixed.

The structure we will go with is a zip file for ingest containing:

  • zip-file
    • pmc.xml (PMC 2.0 DTD)
    • <doi>.<ext> (0..* of these)
  • no directories in zip (all files at top-level)
  • links in pmc must contain absolute doi URI's (e.g. "doi:10.1371/journal.pbio.0020411") if referring to any of the additional objects in the zip. (Note: the stuff on the CD's currently contains only the doi itself, not an absolute uri)

05/10/06 15:46:23 changed by amit

  • description changed.

Edited description.

05/15/06 13:23:48 changed by amit

  • status changed from closed to reopened.
  • resolution deleted.

Ed has a couple of concerns:

  1. He believes that PLoS intends to do away with <ext> long term
  2. It could potentially break the AP to PubMed Central process

I checked with Rebecca with regards to <ext> and she said:

To my knowledge, we will always have extensions. All Ed and I
discussed was not doing that for annotations and comments. All
other associated objects would have extensions, as they do now.

With regards to the second point, Rebecca was not quite sure. AP is coming to town on May 25th and we will chat with them in detail. In the meantime we will go ahead with the scheme we have in mind and hopefully the changes needed after talking to AP will be minor.

05/30/06 13:08:13 changed by amit

  • milestone changed from dodo to snowcrash.

Our meeting with AP shed a little more light in this area:

  1. We will not be required to use Fedora versions. Rebecca would like users to use annotations to mark up articles and after a while they could release a new paper with a new DOI (as far as we are concerned a new article). This simplifies the various formats quite dramatically
  2. Talked to Kevan Meinershagen (developer, AP) about our ideas on absolute URIs in the PMC XML in the SIP. While he said it probably would not be a big problem, he wants to see the details to be sure. I have received his email from Julie Rinke (Director of Client Services and Support, AP) and will be following up with him.

These modifications will be ongoing (we now have the added features of working with PMC and CrossXref?) and that these modifications will probably be for interop and documentation purposes, this will be a moving ticket. Moving to snowcrash.

06/14/06 15:27:10 changed by amit

  • milestone changed from snowcrash to topaz_newton.

06/23/06 20:51:34 changed by amit

  • status changed from reopened to new.

06/23/06 20:51:43 changed by amit

  • status changed from new to assigned.

08/02/06 20:01:57 changed by ebrown

(In [364]) Script to generate article zip files suitable for ingestion

Need to get more than one in for alert testing.

re #16

08/09/06 13:58:17 changed by amit

Have created an FTP environment for Allen Press to load files from the JMS to Merlin. We will use these to test our ingestion process. Rich expects files to be in the by the end of this week.

08/26/06 22:52:28 changed by amit

  • owner changed from amit to ronald.
  • status changed from assigned to new.
  • milestone changed from TBD to september10.

Transferring Tonald to debug the ingestion data from Allen Press.

08/28/06 13:06:46 changed by ronald

  • status changed from new to assigned.

08/29/06 20:56:36 changed by amit

From Ronald sent to Allen Press:

Official Topaz Requirements:
----------------------------

  * zip-file
     o pmc.xml (PMC 2.0 DTD)
     o <doi>.<ext> (0..* of these) 
  * no directories in zip (all files at top-level)
  * links in pmc must contain absolute doi URI's (e.g.
    "doi:10.1371/journal.pbio.0020411") if referring to any of the additional
    objects in the zip. (Note: the stuff on the CD's currently contains only
    the doi itself, not an absolute uri) 


AP's files:
-----------

  * zip file                     (ok)
    o pmc is named <doi>.xml     (error)
    o <doi-part>.<ext>           (error)
  * has directory <doi>          (error)
  * links are totally broken     (error)


Details on the errors:
----------------------

  * there should be no directories. If this is a really problem for them,
    then I could probably handle this in the code without much difficulty.
    But everything would have to be in the same directory (they already do
    this).

  * the pmc needs to be named pmc.xml . 

  * the additional files should use the full doi for the file name
    with the '/' and '.' urlescaped (except for the '.' marking the file
    extension).

  * the links: where to begin? Many/most of them don't even refer to
    stuff not in the zip. E.g. in 000000040, they reference things like
    "pone-000000040-sg001.pdf" which doesn't exist.
    
    When they do reference things that are present, they use the partial
    doi (i.e. they're consistent with the above error in the filename).

    Also, they have xlink:href's of type uri that are not proper
    absolute URI's, e.g. "www.redgreengene.com" instead of
    "http://www.redgreengene.com/" ("www.redgreengene.com" is relative
    URI and would have to interpreted relative to how you got the document,
    e.g. as "https://www.redgreengene.com/" if you got it via an https
    URL, or even "file://www.redgreengene.com" if you got the doc as a
    file).

    And lastly, the mimetypes are off too in various places. Example from
    0000000026:

    <supplementary-material id="pone-000000026-s001" mimetype="application/pdf"
    xlink:href="pone-000000026-s001.tif">
    
    A) there is no ...-s001.tif in the zip, only a -s001.doc
    B) neither the actual file nor the broken link refer to a pdf file


Example layouts:
----------------

Theirs:

  pone.000000040/pone.000000040.g001.tif
  pone.000000040/pone.000000040.pdf
  pone.000000040/pone.000000040.s001.doc
  pone.000000040/pone.000000040.s002.doc
  pone.000000040/pone.000000040.s003.doc
  pone.000000040/pone.000000040.s004.doc
  pone.000000040/pone.000000040.s005.doc
  pone.000000040/pone.000000040.s006.doc
  pone.000000040/pone.000000040.s007.doc
  pone.000000040/pone.000000040.s008.doc
  pone.000000040/pone.000000040.s009.doc
  pone.000000040/pone.000000040.s010.doc
  pone.000000040/pone.000000040.s011.doc
  pone.000000040/pone.000000040.xml 

What we want:

  10%2E1371%2Fjournal%2Epone%2E000000040%2Eg001.tif
  10%2E1371%2Fjournal%2Epone%2E000000040.pdf
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es001.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es002.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es003.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es004.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es005.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es006.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es007.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es008.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es009.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es010.doc
  10%2E1371%2Fjournal%2Epone%2E000000040%2Es011.doc
  pmc.xml 

09/03/06 14:07:36 changed by ronald

Correction: the doi URI's should be "info:doi/10.1371/...", not "doi:10.1371/...". (this affects the links in the pmc.xml)

09/04/06 02:21:18 changed by ronald

(In [572]) Addresses #16: added some input validation. Currently this checks that all links in the pmc.xml are either absolute non-doi URI's, or absolute or relative doi URI's in which case if they have the article's doi as a prefix then the item must exist in the zip.

09/05/06 00:34:19 changed by ronald

  • keywords changed from AIP SIP DIP format to AIP SIP DIP ingest.

09/08/06 18:27:03 changed by ronald

Latest drop from AP, pone.0000001.zip, seems to contain just one change: they removed the directories. Other than that it has the same problems as mentioned above.

09/08/06 18:27:35 changed by ronald

  • milestone changed from september10 to september24.

09/11/06 22:10:33 changed by amit

I got the following email from AP:

Hello Amit,

My name is Duncan Eshelman and I am the Project Manager for PLoS One on the
Allen Press end of things. Please feel free to direct PLoS One specific
questions to my attention.

In answer to your first question, I can provide the following examples of
the file naming conventions in place now:

Naming Conventions
.       Zip file named pone.000000001.zip

.       article xml: pone.000000001.xml

.       article pdf: pone.000000001.pdf

.       graphics: article doi + .g + .incremented 3 digit number + file
extension
               pone.000000001.g001.tif
               pone.000000001.g002.tif

.       tables: article doi + .t + .incremented 3 digit number + file
extension
               pone.000000001.t001.tif
               pone.000000001.t002.tif

.       equations: article doi + [.ex for inline - .e for display] +
.incremented 3 digit number + file extension
               pone.000000001.e001.gif
               pone.000000001.e002.gif

.       multimedia (non-supporting): article doi + .v + .incremented 3 digit
number+ file extension
               pone.000000001.v001.avi
               pone.000000001.v002.mov

.       supporting info: article doi + .s + .incremented 3 digit number+
file extension
               pone.000000001.s001.xls
               pone.000000001.s002.zip


So in the case of multiple xml files within the package, only the main xml
file will be named as pone.000000001.xml (or pone.000000002.xml, etc.)
Supporting information xml files will be named pone.000000001.s001.xls.

In regards to your second question, would you mind providing a brief example
of what you mean by changing the links within the documents to doi's?

Thanks,

Duncan Eshelman

Ronald, this means that we have to go in and modify the links during the ingest process to the doi:.... This should get us most of the way there as I will let them know to insert http: for the links.

09/20/06 18:11:50 changed by ronald

(In [660]) Addresses #16: more hacks to deal with AllenPress?:

  • search for the article xml, trying 'pmc.xml' first, then '[a-z]+\.\d+\.xml'
  • all entries must have the article entry's name (minus extension) as a prefix
  • links may directly point to an entry in the zip; so allowed links are absolute URI's, relative URI's that match an zip entry's name directly, or relative DOI's

09/21/06 00:50:47 changed by amit

I will be talking to Allen Press tomorrow with regards to packages they have dropping off for us to test. Talked to Rich today and nut shell is that their internal process with respect to changes is not clear. If they have documents in the pipeline going through the JMS and Composition system, it is not clear fixes will be able to be done to the entire pipe line or just a check point in time. Bottom line is that the latest packages are not upto their own spec and need to be fixed. And that also means that they will have to go back to previous packages and fix them.

Not sure how much more we can do on this cycle. Will have to push this to the next depending on the call tomorrow (just realized it is today now).

10/02/06 16:34:35 changed by ronald

  • milestone changed from september24 to october16.

Current ingest status is: we handle AP's format, except that we don't change the article yet.

What is needed: the links to embedded objects (images, tables, etc) in the articles need to be rewritten to full doi-uri's.

10/08/06 22:08:57 changed by ronald

(In [769]) Addresses #16: added support for using the name of the zip archive to help find the article entry in the zip.

10/15/06 19:26:45 changed by ronald

(In [787]) Addresses #16 and #183: many fixes and cleanups:

  • Cleaned up rdf generation: all rdf is generated in one place now, and a filter is used to split dc/non-dc for fedora ingest
  • Ensure no unsupported xsd datatypes are ingested into fedora.
  • dc:description now contains a copy of the abstract contents (i.e. including all the elements), instead of just the string-value.
  • Fixed href-to-doi for relative links
  • Added more types to variables
  • Use xsl:sequence instead of xsl:copy-of in most places; this allows for better node tests using 'is' instead of '='

10/18/06 00:42:51 changed by ronald

(In [805]) Addresses #16: numerous fixes, enhancements, and cleanups to article ingest:

  • Added ability for ingest script to generate new content for datastreams; this can either be inline as the content of the <Datastream> element, or written separately using an <xsl:result-document>, or even an arbitrary url.
  • Ingest now fixes up the links in the article so all links to secondary objects are proper absolute info:doi/... uri's.
  • URI's in RDF are now all info:doi/... uri's; previously they were info:fedora/... to appease Fedora's ResourceIndexer?.
  • Fixed escaping of characters in DOI-to-PID and DOI-to-URI mappings.
  • Fixed search algorithm for article entry; it now searches for 'pmc.xml', <zip-file-name>.xml, and shortest filename ending in .xml, in that order.
  • Fixed ignoring of directories in zip.
  • Fixed error handling/reporting.
  • Added test cases for AP formatted zips and for invalid zips.

10/18/06 02:00:05 changed by ronald

  • milestone changed from october16 to TBD.

To summarize the current state of affairs: ingest supports two formats, the Topaz format and AllenPress?' format. Both are delivered as zip archives.

The topaz format is as described above:

  • The article must reside in file called pmc.xml
  • All other files must be named <encoded-doi>.<ext>, where <encoded-doi> is the url-encoded doi of that object.
  • All links in the article must be absolute; links to any of the objects in the zip must be of the form info:doi/<doi>.

The AllenPress format is:

  • The shortest filename ending in .xml is presumed to be the article.
  • All other files must have the article filename (minus the extension) as a prefix.
  • All links in the article must either be absolute or must refer to a filename that exists in the archive.

Additional common notes:

  • Any diretories in the archive are ignored.
  • All files in the archive (except for the article itself, including any alternate formats such as PDF) must be referenced by some link in the article; i.e. no orphans.
  • Objects must be present in the archive for all links which refer to such objects. In the case of the topaz format this means all links that start with info:doi/<article-doi>, and in the case of the AllenPress format this means all relative links.

10/24/06 02:57:52 changed by ronald

(In [841]) Addresses #16: handle more AP brokeness: the paths in zip archives are supposed to relative paths, but AP somehow manages to put absolute paths in there.

10/24/06 03:01:51 changed by ronald

(In [842]) Addresses #16: fix ordering of nextObject links and of associated titles and descriptions. The secondary-object-refs list was supposed to be in document order, but was in zip-archive order.

Found and reported by Steve and Viru.

06/28/07 11:14:47 changed by amit

  • status changed from assigned to closed.
  • resolution set to fixed.

This has become too broad a category to follow within a single ticket. Closing this with the hope that more specific tickets will be created from now on.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted