Note Taking with PDF Tools

NOTE: This post has been modified as of 2015-11-22 Sun – the new code is a little cleaner, and I think the discussion a little fuller.

Almost all of my job-related reading is now done on a screen. There are still disadvantages – I find it much harder to concentrate when reading online – but in other ways it is markedly more convenient.

In particular, it is now much easier to assemble quotations from sources; and now that I’ve found PDF Tools, it has become even easier. I’ve just started to use it to extract annotations from my PDF’s, and it works much better than the lousy command-line hack I was using previously.

As we’re mid-semester, most of my reading is for classes I teach. My current workflow is as follows:

  • Assemble the relevant readings in a Dropbox-synced directory (ClassName/Readings)
  • Using Repligo Reader (apparently no longer available in the app store?), highlight the passages I’m interested in.
  • execute code block (see below) to insert org headings with all highlights from one or more pdfs
  • Assemble reveal.js lecture presentation around those highlights, using org-reveal or Pandoc.

Activating PDF Tools

Begin by installing pdf-tools and org-pdfview from ELPA with package-list~packages or package-install.

Then make sure they are activated by adding these lines in your init file:

(eval-after-load 'org '(require 'org-pdfview))
(add-to-list 'org-file-apps '("\\.pdf\\'" . org-pdfview-open))
(add-to-list 'org-file-apps '("\\.pdf::\\([[:digit:]]+\\)\\'" . org-pdfview-open))

Switching to PDF Tools for annotating and extracting PDF’s

Last month Penguim proposed some changes in a pull request, that export annotations as a set of org headlines. It’s potentially very interesting but not quite what I want to do, so I modified this code. pdf-annot-markups-as-org-text extracts the text of an annotation (stored as the subject attribute in an alist), and also generates a link back to the page in the pdf. mwp/pdf-multi-extract is just a helper function that makes it easier to construct elisp source blocks the way I’m used to doing:

;; modified from 

(defun mwp/pdf-multi-extract (sources)
  "Helper function to print highlighted text from a list of pdf's, with one org header per pdf, 
and links back to page of highlight."
  (let (
        (output ""))
    (dolist (thispdf sources)
      (setq output (concat output (pdf-annot-markups-as-org-text thispdf nil level ))))
    (princ output))

;; this is stolen from
(defun pdf-annot-edges-to-region (edges)
  "Attempt to get 4-entry region \(LEFT TOP RIGHT BOTTOM\) from several edges.
We need this to import annotations and to get marked-up text, because annotations
are referenced by its edges, but functions for these tasks need region."

  (let ((left0 (nth 0 (car edges)))
        (top0 (nth 1 (car edges)))
        (bottom0 (nth 3 (car edges)))
        (top1 (nth 1 (car (last edges))))
        (right1 (nth 2 (car (last edges))))
        (bottom1 (nth 3 (car (last edges))))
        (n (safe-length edges)))
    ;; we try to guess the line height to move
    ;; the region away from the boundary and
    ;; avoid double lines
    (list left0
          (+ top0 (/ (- bottom0 top0) 2))
          (- bottom1 (/ (- bottom1 top1) 2 )))))

(defun pdf-annot-markups-as-org-text (pdfpath &optional title level)
  "Acquire highligh annotations as text, and return as org-heading"

  (interactive "fPath to PDF: ")  
  (let* ((outputstring "") ;; the text to be returned
         (title (or title (replace-regexp-in-string "-" " " (file-name-base pdfpath ))))
         (level (or level (1+ (org-current-level)))) ;; I guess if we're not in an org-buffer this will fail
         (levelstring (make-string level ?*)) ;; set headline to proper level
         (annots (sort (pdf-info-getannots nil pdfpath)  ;; get and sort all annots
    ;; create the header
    (setq outputstring (concat levelstring " Quotes From " title "\n\n")) ;; create heading

    ;; extract text
     (lambda (annot) ;; traverse all annotations
       (if (eq 'highlight (assoc-default 'type annot))
           (let* ((page (assoc-default 'page annot))
                  ;; use pdf-annot-edges-to-region to get correct boundaries of highlight
                  (real-edges (pdf-annot-edges-to-region
                               (pdf-annot-get annot 'markup-edges)))
                  (text (or (assoc-default 'subject annot) (assoc-default 'content annot)
                            (replace-regexp-in-string "\n" " " (pdf-info-gettext page real-edges nil pdfpath)
                                                      ) ))

                  (height (nth 1 real-edges)) ;; distance down the page
                  ;; use pdfview link directly to page number
                  (linktext (concat "[[pdfview:" pdfpath "::" (number-to-string page) 
                                    "++" (number-to-string height) "][" title  "]]" ))
             (setq outputstring (concat outputstring text " ("
                                        linktext ", " (number-to-string page) ")\n\n"))
    outputstring ;; return the header

Using in Org with a Source Block

Now it’s more or less trivial to quickly generate the org headers using a source block:

#+BEGIN_SRC elisp :results output raw :var level=(1+ (org-current-level))
(mwp/pdf-multi-extract '(
                   "/home/matt/HackingHistory/readings/Troper-becoming-immigrant-city.pdf"  "/home/matt/HackingHistory/readings/historical-authority-hampton.pdf"))


And the output gives something like


Quotes From Troper becoming immigrant city

Included in the Greater Toronto Area multiethnic mix are an estimated 450,000 Chinese, 400,000 Italians, and 250,000 African Canadians, the largest component of which are ofCar- ibbean background, although a separate and distinct infusion of Soma- lis, Ethiopians, and other Africans is currently taking place. (Troper becoming immigrant city, 3)

Although Toronto is Canada’s leading immigrant-receiving centre, city officials have neither a hands-on role in immigrant selection nor an official voice in deciding immigration policy. In Canada, immigration policy and administration is a constitutional responsibility of the fed- eral government, worked out in consultation with the provinces. (Troper becoming immigrant city, 4)


Alternative: Temporary buffer with custom link type

An alternative workflow would be to pop to a second, temporary buffer and insert the annotations there; one could do this with a custom link type. PDF-Tools already has a mechanism for listing annotations in a separate buffer, but it’s not designed for quick access to all annotations at once. Anyway, here’s one way to do this; I’m not really using it at the moment.

(org-add-link-type "pdfquote" 'org-pdfquote-open 'org-pdfquote-export)

(defun org-pdfquote-open (link)
  "Open a new buffer with all markup annotations in an org headline."
   (format "*Quotes from %s*"
           (file-name-base link)))
  (insert (pdf-annot-markups-as-org-text link nil 1))
  (goto-char 0)

(defun org-pdfquote-export (link description format)
  "Export the pdfview LINK with DESCRIPTION for FORMAT from Org files."
  (let* ((path (when (string-match "\\(.+\\)::.+" link)
                 (match-string 1 link)))
         (desc (or description link)))
    (when (stringp path)
      (setq path (org-link-escape (expand-file-name path)))
       ((eq format 'html) (format "<a href=\"%s\">%s</a>" path desc))
       ((eq format 'latex) (format "\href{%s}{%s}" path desc))
       ((eq format 'ascii) (format "%s (%s)" desc path))
       (t path)))))

(defun org-pdfquote-complete-link ()
  "Use the existing file name completion for file.
Links to get the file name, then ask the user for the page number
and append it."                                  

  (replace-regexp-in-string "^file:" "pdfquote:" (org-file-complete-link)))

I’ve also added two bindings to make highlighting easier from the PDF buffer:

(eval-after-load 'pdf-view 
                    '(define-key pdf-view-mode-map (kbd "M-h") 'pdf-annot-add-highlight-markup-annotation))
(eval-after-load 'pdf-view 
                    '(define-key pdf-view-mode-map (kbd "<tab>") 'pdf-annot-add-highlight-markup-annotation))

All of this is getting me very close to using Emacs for all my PDF work. I am doing maybe 50% of my PDF work in Emacs instead of on my tablet. It’s incredibly convenient, although I still find it a little harder to concentrate on my laptop than on the tablet (for reasons ergonomic, optical, and psychological). Here are the remaining papercuts from my perspective:

  • Highlighting text with the mouse is more awkward and less intimate than using my fingertip on a laptop. I often find mouse movement a little awkward in Emacs, but pdf-view purposely relies on the mouse for movement (for good reasons).
  • Scrolling in pdf-view is also a bit awkward, and there’s no “continuous” mode as one might find in Evince or acroread. Again, I often find scrolling an issue in Emacs, so this might not be so easy to fix.
  • Finally, the laptop screen is just harder on my eyes than my high-res tablet. pdf-view hasa “midnight mode” which makes it a little easier to read, but it’s not quite enough.

So, for the time being I will probably do much of my reading on the tablet. But for short pieces and for review (e.g., papers that I’m reading for the third year in a row in a graduate seminar…) PDF Tools is now my main interface. Which is pretty sweet.


UPDATE: I would like to extend the pdfview link type (in org-pdfview) to permit me to specify the precise location of an annotation, so I can jump precisely to that part of the page. This has now been done and the code above has been updated to reflect the new syntax.

UPDATE: Also, now that I think about it, it might be interesting to just have a link type that pops up a temporary buffer with all of the annotations; I could then cut and paste the annotations into the master document. This might be even more convenient. OK, I’ve implemented this, see above!

Update 2015-11-22 Sun: I’ve cleaned up some of the code, and added a bit more commentary at the end.