iSchool alum collaborates on "Jailbreaking the PDF"

Casey McLaughlin
Casey McLaughlin

Technology has expedited advancements in the Information Age as knowledge and data continues to be shared throughout the world with the click of button.

However, sometimes, information produced from research studies is not always easily accessible for analysis in the world of academia.

While working on projects at the Institute for Digital Information and Scientific Communication (iDigInfo) at Florida State University’s College of Communication & Information, alum Casey McLaughlin and his fellow researchers knew the information they needed to analyze, but the data was “locked up” in Portable Document Format (PDF).

iDigInfo had received grant funding from the U.S. Department of Defense to establish the Department of Defense Military Suicide Research Consortium (MSRC) along with the Denver Veterans Affairs Medical Center. iDigInfo is responsible for the Information Management/Scientific Communications Core of the project with the goals that include a rapid response function so that queries from decision makers and others of the MSRC will be answered in an efficient and timely manner.

“The institute’s portion of that funding was to take what was already in existence and make it more available to researchers,” said McLaughlin, who worked at iDigInfo from 2011-13.  “There’s a lot of literature out there already, but how do you find it?”

The Institute collected over 6,000 research papers about suicide and was tasked with figuring out how to expose the information in innovative ways.

“We started dreaming up this type of interface that would allow users to search for specific research parameters and it would spit out the information from the collection,” McLaughlin said.  “In order to enable doing fine-grain queries for research papers you have to expose the information in the paper in a way it can be queried.  Research publications can be queried in certain general terms (metadata), but we want to be more precise than that.”

Because research papers are typically presented at PDFs, computers cannot interpret the meaning of the data in the documents.

“We wanted to convert the PDFs into a format that could be queried,” McLaughlin said.  “We looked at a bunch of existing technologies that did a good job at looking at information you could query (like machine-readable XML) and wanted to convert PDFs to these formats.  It seemed like a pretty straightforward thing to do … but it’s not.”

After McLaughlin did a little detective work, it turned out iDigInfo was not the only group of researchers with this problem.

“It turned out there are a lot of people out there who are interested in this,” McLaughlin said.  “In just about any domain of research they would like to take advantage of computers to get information out of research papers a lot more easily.”

McLaughlin joined Peter Murray Rust, a chemist at Cambridge, and Alexander Garcia, a bioinformatics researcher, to find a solution.  The group even held a PDF Hackathon last May while attending the Extended Semantic Web Conference in France with other researchers.

Through this process McLaughlin wrote an application called Xtract PDF that partially works to expose information although it still needs some human edits before it can be queried.

“It’s mainly of interest to a small community of researchers, but it is an international community that is very passionate about transforming the way research papers are written and research papers are exposed,” McLaughlin said.

McLaughlin received both his Bachelor of Science and Master of Science in Information Technology from Florida State, and worked for the School of Information’s Help Desk and then the Director of Application Development, where he developed the School’s iSpace application where students could showcase their portfolios and experiment with code.  He now works at Florida State’s Research Computing Center where he manages user support, documentation and communication.