pmcXML and fulltext

Discussion:

pmcXML and fulltext

Chris Stubben

2014-09-03 18:16:53 UTC

I just noticed this discussion and wanted to suggest a package on github
(https://github.com/cstubben/pmcXML) that might be helpful. I really need
to clean up the code, but the main objectives are to read PMC Open access
into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables and 4)
supplements (technically not in XML, but included on ftp or via links in
XML). Specifically, I parse the text into a list of subsections (vector
of paragraphs) and get the full path to the subsection title for the list
names. This is easy to convert into a tm Corpus or write to a Solr XML
file for importing or apply a function like sentDetect from openNLP to
split into sentences for searching. I read tables into a list of
data.frames and I have been working on code to improve readHTMLtables by
repeating subheaders, filling row and colspans, correctly parsing
mutli-row headers and so on.

Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.

Please let me know how I can help, thanks.

Chris

--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

Scott Chamberlain

2014-09-05 16:41:49 UTC

Permalink

Hi Chris,

Thanks for reaching out. I assume by "this discussion", you meant the
fulltext package discussion?

I'll have a look at your repo. The Solr part sounds interesting. I
maintain an R client for Solr (https://github.com/ropensci/solr), perhaps
that could be useful to interact with Solr. I actually don't think I have
any methods for writing to a Solr instance right now, but would be easy to
add.

Would also be good to include functionality in fulltext package for working
with pdfs, so that could be something to include for sure.

Best way to jump in with fulltext is to look over the
issues https://github.com/ropensci/fulltext/issues . We haven't written
much code yet, so much is still up in the air.

David Winter is maintaining the rentrez package
(https://github.com/ropensci/rentrez). I'll ping him to see if he has
anything to add to this discussion.

Cheers, Scott

I just noticed this discussion and wanted to suggest a package on github (
https://github.com/cstubben/pmcXML) that might be helpful. I really
need to clean up the code, but the main objectives are to read PMC Open
access into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables
and 4) supplements (technically not in XML, but included on ftp or via
links in XML). Specifically, I parse the text into a list of subsections
(vector of paragraphs) and get the full path to the subsection title for
the list names. This is easy to convert into a tm Corpus or write to a
Solr XML file for importing or apply a function like sentDetect from
openNLP to split into sentences for searching. I read tables into a list
of data.frames and I have been working on code to improve readHTMLtables by
repeating subheaders, filling row and colspans, correctly parsing
mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris

David Winter

2014-09-05 17:12:17 UTC

Permalink

HI Guys,

I don't have much to add at this point -- if pmcXML already handles parsing
out information from PMC records then it makes sense to use that
functionality.

I don't know if rOpenSci has run into a package with dependencies on
Bioconductor packages yet? I don't know if there is a nice way to handle
both dependencies if the fulltext package on CRAN (I'm sure it's solvable,
I just don't know how :)

David

Post by Scott Chamberlain
Hi Chris,
Thanks for reaching out. I assume by "this discussion", you meant the
fulltext package discussion?
I'll have a look at your repo. The Solr part sounds interesting. I
maintain an R client for Solr (https://github.com/ropensci/solr), perhaps
that could be useful to interact with Solr. I actually don't think I have
any methods for writing to a Solr instance right now, but would be easy to
add.
Would also be good to include functionality in fulltext package for
working with pdfs, so that could be something to include for sure.
Best way to jump in with fulltext is to look over the issues
https://github.com/ropensci/fulltext/issues . We haven't written much
code yet, so much is still up in the air.
David Winter is maintaining the rentrez package (
https://github.com/ropensci/rentrez). I'll ping him to see if he has
anything to add to this discussion.
Cheers, Scott

Post by Chris Stubben
I just noticed this discussion and wanted to suggest a package on github
(https://github.com/cstubben/pmcXML) that might be helpful. I really
need to clean up the code, but the main objectives are to read PMC Open
access into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables
and 4) supplements (technically not in XML, but included on ftp or via
links in XML). Specifically, I parse the text into a list of subsections
(vector of paragraphs) and get the full path to the subsection title for
the list names. This is easy to convert into a tm Corpus or write to a
Solr XML file for importing or apply a function like sentDetect from
openNLP to split into sentences for searching. I read tables into a list
of data.frames and I have been working on code to improve readHTMLtables by
repeating subheaders, filling row and colspans, correctly parsing
mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris

Chris Stubben

2014-09-05 21:30:01 UTC

Permalink

Scott, David, and others,

Thanks for pointing out the issues page - i'm definitely interested in
tracking down other sources and considering the best data structures to
use, although right now I'd prefer getting full text into
XMLInternalDocument or HTMLInternalDocument and then add code to simplify
parsing into lists. HTML is always messy and I have made some attempts at
pmc's HTML (not open access), Elsevier, SGM, ASM, Wiley and a few others.
All of these have various restrictions, but that's constantly changing - I
think Elsevier will now allow text mining and even returning snippets of
text with 200 characters and the DOI, which seems workable for Solr
queries..

This github wiki page probably describes best what I've been trying to do

https://github.com/cstubben/pmcXML/wiki/Parse-xml

Finally, the bioC dependency could be removed. Basically, this was added
to expand locus tags ranges mentioned in full text, so you need GFF files
and other genome related stuff.

Chris

You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Scott Chamberlain

2014-09-05 21:39:00 UTC

Permalink

Chris,

I agree that there should be an option to get as raw a response as possible
instead of only giving back e.g, a list. Probably just allow users to
toggle what format they get data in.

In terms of data sources, there's an issue for
that https://github.com/ropensci/fulltext/issues/4#issuecomment-52376743,
and that comment specifically holds the master list so far. We can add
things as needed to that list.

Scott

Post by Chris Stubben
Scott, David, and others,
Thanks for pointing out the issues page - i'm definitely interested in
tracking down other sources and considering the best data structures to
use, although right now I'd prefer getting full text into
XMLInternalDocument or HTMLInternalDocument and then add code to simplify
parsing into lists. HTML is always messy and I have made some attempts at
pmc's HTML (not open access), Elsevier, SGM, ASM, Wiley and a few others.
All of these have various restrictions, but that's constantly changing - I
think Elsevier will now allow text mining and even returning snippets of
text with 200 characters and the DOI, which seems workable for Solr
queries..
This github wiki page probably describes best what I've been trying to do
https://github.com/cstubben/pmcXML/wiki/Parse-xml
Finally, the bioC dependency could be removed. Basically, this was added
to expand locus tags ranges mentioned in full text, so you need GFF files
and other genome related stuff.
Chris
On Fri, Sep 5, 2014 at 10:41 AM, Scott Chamberlain <

Post by Scott Chamberlain
Hi Chris,
Thanks for reaching out. I assume by "this discussion", you meant the
fulltext package discussion?
I'll have a look at your repo. The Solr part sounds interesting. I
maintain an R client for Solr (https://github.com/ropensci/solr),
perhaps that could be useful to interact with Solr. I actually don't think
I have any methods for writing to a Solr instance right now, but would be
easy to add.
Would also be good to include functionality in fulltext package for
working with pdfs, so that could be something to include for sure.
Best way to jump in with fulltext is to look over the issues
https://github.com/ropensci/fulltext/issues . We haven't written much
code yet, so much is still up in the air.
David Winter is maintaining the rentrez package (
https://github.com/ropensci/rentrez). I'll ping him to see if he has
anything to add to this discussion.
Cheers, Scott

Post by Chris Stubben
I just noticed this discussion and wanted to suggest a package on
github (https://github.com/cstubben/pmcXML) that might be helpful. I
really need to clean up the code, but the main objectives are to read PMC
Open access into XMLInternalDocuments, then parse 1) metadata 2) text 3)
tables and 4) supplements (technically not in XML, but included on ftp or
via links in XML). Specifically, I parse the text into a list of
subsections (vector of paragraphs) and get the full path to the subsection
title for the list names. This is easy to convert into a tm Corpus or
write to a Solr XML file for importing or apply a function like sentDetect
from openNLP to split into sentences for searching. I read tables into a
list of data.frames and I have been working on code to improve
readHTMLtables by repeating subheaders, filling row and colspans,
correctly parsing mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris
--