Endpoints
Parsing
Response Format
The response format of every parsing endpoint is the same and represents the article with a json schema.
{
"title": "This is the title of the article.",
"authors": [
"Author 1",
"Author 2",
"Author 3"
],
"abstract": [
"This is the paragraph 1 of the abstract.",
"This is the paragraph 2 of the abstract."
],
"section_paragraphs": [
["Section 1", "Paragraph 1"],
["Section 2", "Paragraph 2"]
],
"pubmed_id": "PUBMED_ID",
"pmc_id": "PMC_ID",
"arxiv_id": "ARXIV_ID",
"doi": "DOI",
"uid": "UNIQUE_ID",
"date": "2017-03-01",
"journal": "JOURNAL",
"figures": [
"Figure 1",
"Figure 2"
],
"tables": [
"Table 1",
"Table 2"
],
"article_type": "Article Type"
}
Fields properties
Fields parameter | Description |
---|---|
title |
Title of this article. |
authors |
A list of authors. |
abstract |
The list of paragraphs of the abstract for this article. |
section_paragraphs |
The list of paragraphs with their corresponding section name. |
pubmed_id |
A unique identifier used in the PubMed database. |
pmc_id |
A unique identifier used in the PubMed Central (PMC) database. |
arxiv_id |
A unique identifier used in the ArXiv ID database. |
doi |
Digital Object Identifier of this article. |
uid |
Unique identifier of this article given based on the hash of the file. |
date |
Publication date of this article. |
journal |
The journal International Standard Serial Number of the article. |
figures |
Figures list. |
tables |
Tables list. |
article_type |
Type of the article. |
Parsing XMLs
Calling the endpoint
All XMLs parsing endpoints are following the same schema. One just needs to specify/change the right schema at the end of the endpoint path.
Schemas
JATS XML
If one needs to parse an article following the JATS-XML schema, this is the endpoint that has to be used. This is the case of scientific papers coming from Pubmed Central (PMC).
PubMed XML
If one needs to parse an article following the PubMed schema, this is the endpoint that has to be used.
This is the case of scientific papers coming from PubMed. Note that
the pubmed xml files contain only the abstract. The field section_paragraphs
is going to be empty.
TEI XML
If one needs to parse an article following the TEI-XML schema, this is the endpoint that has to be used. This is the case of scientific papers coming from ArXiv. This is also the schema returned by Grobid server.
XOCS XML
If one needs to parse an article following the XOCS-XML schema, this is the endpoint that has to be used. This is the case of scientific papers coming from scopus.
Parsing PDFs
The package contains two endpoints to parse PDFs (PyPDF and Grobid).
PyPDF
Calling the endpoint
import requests
import json
url = "https://localhost/parse/pypdf"
files = [
(
"inp",
(
"test.pdf",
open(test_data_path / "test.pdf", "rb"),
"application/pdf",
),
)
]
# This line is not mandatory, but user can change the chunk_size if wanted.
data = {"chunk_size": 500}
response = requests.request("POST", url, files=files, data=data)
print(response.json())
Grobid PDF
Calling the endpoint
import requests
import json
url = "https://localhost/parse/grobidpdf"
files = [
(
"inp",
(
"test.pdf",
open(test_data_path / "test.pdf", "rb"),
"application/pdf",
),
)
]
# This line is not mandatory, but user can change the parameters if wanted.
data = {"data": '{"param1": "value1", "param2": "value2"}'}
response = requests.request("POST", url, files=files, data=data)
print(response.json())