Chris Stubben
2014-09-03 18:16:53 UTC
I just noticed this discussion and wanted to suggest a package on github
(https://github.com/cstubben/pmcXML) that might be helpful. I really need
to clean up the code, but the main objectives are to read PMC Open access
into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables and 4)
supplements (technically not in XML, but included on ftp or via links in
XML). Specifically, I parse the text into a list of subsections (vector
of paragraphs) and get the full path to the subsection title for the list
names. This is easy to convert into a tm Corpus or write to a Solr XML
file for importing or apply a function like sentDetect from openNLP to
split into sentences for searching. I read tables into a list of
data.frames and I have been working on code to improve readHTMLtables by
repeating subheaders, filling row and colspans, correctly parsing
mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris
(https://github.com/cstubben/pmcXML) that might be helpful. I really need
to clean up the code, but the main objectives are to read PMC Open access
into XMLInternalDocuments, then parse 1) metadata 2) text 3) tables and 4)
supplements (technically not in XML, but included on ftp or via links in
XML). Specifically, I parse the text into a list of subsections (vector
of paragraphs) and get the full path to the subsection title for the list
names. This is easy to convert into a tm Corpus or write to a Solr XML
file for importing or apply a function like sentDetect from openNLP to
split into sentences for searching. I read tables into a list of
data.frames and I have been working on code to improve readHTMLtables by
repeating subheaders, filling row and colspans, correctly parsing
mutli-row headers and so on.
Our main use case is that we may know of ~500 papers on some non-model
microbial organism and we'd like to make a local Solr collection for
enhanced searching, so we need to index all the text, tables and
supplements. The main problem is getting all the papers into Solr that are
not in PMC, so developing new packages to work with our local copies of XML
docs from Elsevier or even html from other publishers or PDFs would be
great. Also, for PDFs only, I have been working on code to convert these
to Markdown, which I can then read into R and convert to Solr import files.
Please let me know how I can help, thanks.
Chris
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.