
Pdf-reader would parse this into “This is a story my life got flipped all about how turned upside-down,” which led to issues when searching for multi-word phrases.

| This is a story | my life got flipped |
#Ruby pdf extract text install#
In a (Debian-based) Dockerfile: RUN apt-get update & \Īpt-get install -y libgirepository1.0-dev libpoppler-glib-dev & \ On (Debian-based) Linux: apt-get install libgirepository1.0-dev libpoppler-glib-dev Poppler installs as a standalone library.
#Ruby pdf extract text how to#
This worked great and here’s how to do it.

Our first attempt involved the pdf-reader gem, which worked admirably with the caveat that it had a little bit of trouble with multi-column / art-directed layouts 2, which was a lot of the content we were dealing with.Ī bit of research uncovered Poppler, “a free software utility library for rendering Portable Document Format (PDF) documents,” which includes text extraction functionality and has a corresponding Ruby library. There were no other children to keep Ruby company, to play with and learn with, to eat lunch with. When Ruby got inside the building, she was all alone except for her teacher, Mrs. The white people in the neighborhood would not send their children to school.

The trick was to figure out how to programmatically extract that content. Ruby would hurry through the crowd and not say a word. Fortunately, the example PDFs they provided us had embedded text content 1, i.e. the text was selectable. Pretty straightforward stuff, with the hiccup that they wanted the magazine content to be searchable. A recent client request had us adding an archive of magazine issues dating back to the 1980s.
