[ToDo: COS] All but three items in ASAPbio’s requirements are complete (items: ii, iii, and vi from Section 2 above). Integration with Crossref metadata for easier discovery of other versions. On inclusion of full-text, context of results will be displayed. OSF’s rendering system supports conversion pipelines, UI to download in alternative formats will be available. For item ix, we will conduct an accessibility review.
Provide an alerting tool that delivers emails/other notifications to authors when content of interest (by keyword, author, citing their article, etc) appears.
[Done: COS] Users can receive notifications never, as they occur, or as a daily digest. Notifications are enabled for behaviors related to commenting and adding files. Users have control of notification settings at the user and project level. Authors could elect to receive notifications for some preprints and not others.
[ToDo: COS] Subscriptions to notifications by author, keyword, discipline, service. Surfacing notifications control about one’s own preprints more efficiently.
Respondents are also invited to highlight other functionality they would suggest the site should support, although all features and functionality of the CS will require Governance Body approval.
See section 9 (page 20)
Preprints Commons Machine Interface
API for manuscripts
Make available the full text of preprints (converted version and author’s original files) by RESTful API and bulk download
OAI-PMH endpoint, or equivalent if standards change over time
Capture metrics on API use
[Done: COS] A RESTful API is available for hosted preprints; converted output will be added.
[ToDo: COS] For full-text harvested from other services, a similar API will be available. Bulk downloads will be accessible as time-based archive files, torrents, and, potentially, rsync. An OAI-PMH is due for release in May 2017. A new metrics API aligned with community efforts (COUNTER) is scheduled for Q3 for more robust queries.
Screening training set
Make available all manuscripts (both those that passed screening and those that did not) and their screening histories for bulk download, with any sensitive information redacted. This corpus will be used to train automated screening algorithms.
[ToDo: COS] We need clarification of the requirements for this item. Creating a full dataset will be straightforward if preprint services are willing and able to share full-text of approved and rejected manuscripts.
We also need clarification on how the automated screening algorithms can be trained if the sensitive information is redacted and the redacted information is the basis for rejection. Our operating assumption is that the training dataset will remain secure. Researchers developing screening algorithms would gain approved access.
Manuscript Conversion Tool
Convert the full text of manuscripts to XML or equivalent, tagged according to JATS4R standards. As a minimum it should be possible to convert a .docx or .tex file to XML. The feasibility of converting a .pdf file to XML should be discussed.
[Done: Partners] Manuscripts completes document import and export operations with a cross-platform component executable as a web service that orchestrates pre-existing open source and custom document transformation steps. Heuristics infer and correct format specific issues in Word, LaTeX and Markdown documents. A dozen input formats are supported (DOC, DOCX, DocBook XML, Evernote XML, EPUB, LaTeX, Markdown, ODT, OPML, Markdown variants, ReStructured Text, Microsoft style RTF, Textile).
The document importers support scholarly needs: footnotes, endnotes, equations, cross-references, embedded citation metadata is processed from EndNote, Papers, Zotero, Mendeley formatted field codes in case of docx, tracked changes are automatically applied. Citations, equations, figures and tables with their captions are captured from LaTeX and converted to XML. The implementation is hardened based on a large collection of user provided documents.
[To do: Partners] Prepare for open sourcing. We expect to achieve basic PDF-to-XML extraction: indexing grade text, extracting figures, document outline. Experience to date suggests that efforts to infer structure in PDF documents beyond this carries high implementation risk and diminishing returns; PDF is a format for drawing for print, translating it into structured rich text only incrementally improves indexing quality, and will not yield XML to produce a readable document.
Allow authors to preview, proof, and modify the converted XML through an author-friendly interface
[Done: Partners] Editing experience is fully WYSIWYG. The format produced by the editor follows the JATS4R recommendations and document validity is enforced continuously as user edits are persisted to a version history. In addition to JATS4R, Manuscripts can provide HTML5 export capabilities, normalised to the Scholarly HTML tagging conventions drafted by the W3C Scholarly HTML Community Group.
[ToDo: Partners] These services will be adapted to allow preprints authors to preview, proof, and modify the converted XML. The editing experience will be integrated with the preprints infrastructure for accounts, storage, and notifications.
The manuscript conversion tool must be able to interoperate with a wide range of preprint servers and journals that operate on different technological platforms.
[ToDo: Partners] The service will be modularized such that preprint services could integrate with their own workflows. We will expose PressRoom document conversion service via REST API.
The conversion process could occupy several different positions in the pipeline of manuscripts, and the preferred position will depend on input from the GB and the ingestion sources. Describe how you would implement the following options (and any others that you foresee will be compatible with many ingestion sources) and indicate which you see as the preferred option.
All material coming into the Commons from ingest sources will be ingested as structured XML. In this case, the conversion tool is offered as a software or hosted as a Commons service accessible by API, and individual ingest sources are responsible for implementing it or generating XML by alternative means.
Material coming from ingest sources can be provided to the Commons as the author’s original manuscript file (such as .docx or .tex). In this case, the Commons would convert the manuscript and contact authors (ie by email) to invite them to proof a rendition of the manuscript after conversion to XML. The ingest source could later retrieve the converted manuscript from the Commons.
The conversion tool could function upstream of ingest sources. A submission tool hosted by the CS could convert manuscripts for author proofing and then provide options for authors to send the converted manuscripts (or original files, or PDFs) to other ingest sources.
[ToDo: Partners] A RESTful web service to document conversion will accept any 14 supported input file formats as input, and provide HTML5 or JATS XML as output. This includes document conversion and a native desktop-based application for proofing.
Automated Screening Tool
Flag manuscripts with certain characteristics similar to manuscripts that have not passed screening. Initially, include factors that may include single authorship, presence of keywords, lack of scientific writing style, image style, lack of biological subject matter, presence of human faces, etc.
[ToDo: COS] As the corpus of preprints builds in a format conducive to programmatic access (e.g., JATS), properties of the preprints will be extracted and persisted. As submissions are determined to be spam, non-scientific, etc., these properties will be associated with the reason for disqualification. This data will be used to compare incoming submissions as part of the submission pipeline, flagging for human intervention distribution outliers and manuscripts similar to those previously disqualified (compared using multiple metrics that can be used in, e.g., random forest models and support vector machines). We will ask peer preprint services for access to their screened-out manuscripts, if they have license to share them for the corpus.
Early comparison data will come from metadata currently harvested by SHARE and full-text collected for use in The Commons aggregated search and machine-learning corpus. Heuristic and targeted identification will flag submissions. Face recognition (e.g. OpenCV) will be run against extracted figures to notify moderators of potential ethical violations.
Lee Giles (PSU; CiteSeerx) will consult for these tasks, especially for advanced use-cases.
Use external services (like plagiarism detection) but ideally develop a viable open source alternative in the long term.
[ToDo: COS] Membership with Crossref supports using their Similarity Check (powered by iThenticate) to flag suspected plagiarism. Other screening tools will be made open-source, but, for plagiarism in particular, the tool would require a corpus that is most likely impractical to license from publishers compared to using a service like Similarity Check, which has been pre-negotiated for members.
Provide service accessible by API
[ToDo: COS] Pipelines created for screening will be accessible by API.