Application Deployment
The scholarag
package contains an application that one can deploy.
To benefit from all the functionalities, some infrastructure needs to be deployed too:
- One needs first to deploy the application
scholarag
itself. - A database containing the text content from scientific articles. The package currently supports
OpenSearch
andElasticSearch
databases. - (Optional) If the database needs to be populated, a tool parsing scientific articles is needed.
This is the case of
scholaretl
that is fully compatible withscholarag
. Ifscholaretl
is used and some scientific papers are saved underpdf
format, one also needs to deploy agrobid
server. - (Optional) A database to be able to do caching.
Redis
is the only solution supported byscholarag
.
To deploy everything, one needs first to create fill the two following environment variables
in the compose.yaml
file.
SCHOLARAG__DB__INDEX_PARAGRAPHS=TO_BE_SPECIFIED
SCHOLARAG__GENERATIVE__OPENAI__TOKEN=TO_BE_SPECIFIED
Once this is done, one can simply use docker compose up
command from the root folder of the package:
sholarag-app
containing the application that can be reached from localhost:8080
.
- The opensearch database is deployed under opensearchproject/opensearch:2.5.0
called scholarag-opensearch-1
and reachable on localhost:9200
.
- grobid/grobid:0.8.0
called scholarag-grobid-1
being the grobid server reachable on localhost:8070
.
- The redis instance called scholarag-redis-1
reachable on localhost:6379
.
- The ETL application ETL IMAGE
called scholarag-etl-1
reachable on port 9090
.
To destroy everything, one can simply use the following command:
If one keeps the volumes setup inside thecompose.yaml
file, the data inside the opensearch
database
is going to persist between different sessions.
Database population
To populate the database with data and to use all the functionalities of the scholarag
application,
two indices need to be created:
- One containing the text content to use for the question answering
- If text are coming from scientific papers, an index containing the impact factors of the different scientific journals.
Both indexes can be created and populated through two scripts (also deployed as endpoints) available in scholarag
package.
For the first index, the script is parse_and_upload.py
script (pmc_parse_and_upload
can also be used if one wants to upload PMC papers
to the database). For the second, one can launch create_impact_factor_index.py
. Please refers to the scripts
documentation for
further information. Both can be launched locally or after spawning a new docker container with the package installed with the
following command line:
The flag --network=host
is not mandatory but it is allowing a user to easily connect to
the database deployed by referring to it as http://localhost:9200
.
As explained in the documentation, the script needs a parser_url
as input.
We recommend to use scholaretl
, a package fully compatible with scholarag
, to populate the first index.
The purpose of the package is indeed to parse scientific articles with different formats and schemas (XML and PDF).
Launching docker compose
is spawning this scholaretl
application that is then directly usable.
The population of the database can then be launched using the following command (inside or outside a docker container):
pmc-parse-and-upload http://localhost:9200 http://localhost:9090/parse/jats_xml
create-impact-factors-index file_name impact_factors http://localhost:9200
pmc_paragraphs
but it can be changed by adding the flag --index
.
For the impact factor, one needs first to copy the file containing the information inside the docker if the script is launched inside the docker container. To copy it, one can launch the command: