Copy highlighted text into comments from a PDF file

One of the nice features of Acrobat is that you can highlight text and then export only the highlighted part into a different document.  However, in order to do that, the user has to remember to tick the option "Copy selected text into Highlight, Cross-Out, and Underline comment pop ups" in Edit - Preferences - Commenting. Unfortunately, this setting is not on by default and only available  in Acrobat (8,9 and X) but not in Acrobat Reader, as said here.

It can happens that we highlight a really big document in order to export the highlighted parts, and then we remember that the "Copy selected text..." was actually off! Our highlighted parts won't be commented, meaning that we cannot export them (actually, we *can* export them, but they will be just empty boxes). There is no way to retroactively  copy all the highlighted part into comment from the graphic interface.

screensh2

However, it is possible to solve this problems by using some code. This solution will also work if you have Acrobat Reader!

There is already a software online that does that, on this website, but the price is really high (from $40 to $75!), so I developed my own script and I am going to put it out for free:

HOW TO RETROACTIVELY COPY HIGHLIGHTED TEXT INTO COMMENTS IN ACROBAT PDF

1. OPEN JAVASCRIPT EDITOR. First of all open the document with the highlighted text, and press ctrl+j. In Acrobat Pro, this should open the JavaScript editor window. If it works, just go to point 2, otherwise keep reading.

It ctrl+j doesn't do anything, than you don't have Acrobat Pro, but don't worry, you can still open the JavaScript editor. Go on this website and download the Reader JavaScript Console Window. Download the file anywhere, and follow the instruction in the ReadMe file. Once done that, you should see a new voice in the PDF menu, "Extension", and clicking on "Debugger" should open the JavaScript editor window. Done!

2. COPY THE SCRIPT. Delete any text on the bottom window, and copy and paste this code:

var annots = this.getAnnots({nSortBy: ANSB_Page});
console.println("nAnnot Report for document: " + this.documentFileName);
if ( annots != null ) {
console.println("Number of Annotations: " + annots.length);
var annotList=[];
for (var i = 0; i < annots.length;i++) {
        
        var annotTxt="";
        while (annots[i].type!="Highlight") {
            annotTxt="****";
            i=i+1;
        }
        if (i>=annots.length) {
             break;
        }
      
        pageNum=annots[i].page;
        var quadAN=annots[i].quads.toString();
        var qaAN=quadAN.split(",");
        for (var ii=0; ii<qaAN.length; ii++) {
            qaAN[ii]=parseFloat(qaAN[ii]);
        }
        for (var w=0; w<getPageNumWords(pageNum); w++) {
            var quadWD=getPageNthWordQuads(pageNum,w).toString();
                        //console.println("QWD TYPE: " + typeof quadWD + ": " + quadWD)
            var qaWD=quadWD.split(",");
            for (var ii=0; ii<qaWD.length; ii++) {
                qaWD[ii]=parseFloat(qaWD[ii]);
            }
            
            var nlines=qaAN.length/8; counter=0;
            for (var nn=0; nn<nlines; nn++) {
                if (qaWD[0]>=qaAN[counter+0]-4.5 &&
                    qaWD[1]<=qaAN[counter+1]+0.5 &&
                    qaWD[2]<=qaAN[counter+2]+4.5 &&
                    qaWD[3]<=qaAN[counter+3]+0.5 &&
                    qaWD[4]>=qaAN[counter+4]-4.5 &&
                    qaWD[5]>=qaAN[counter+5]-0.5 &&
                    qaWD[6]<=qaAN[counter+6]+4.5 &&
                    qaWD[7]>=qaAN[counter+7]-0.5) {
                    annotTxt=annotTxt+" "+getPageNthWord(pageNum,w);
                }
            counter=counter+8;
            }
        counter=0;
            
        }
        //UNcomment one line below if you want to show information about the annotations
        //annotTxt="ANNOT N." + i + " PAGE NUM: " + (pageNum+1) + " : " + annotTxt;
        annots[i].contents=annotTxt;
        annotList[annotList.length]=annotTxt;
        //UNcomment the line below if you want to print on the screen the annotations
                //console.println(annotTxt)

}

} else
console.println(" No annotations in this document.");

console.println("DONE!")

Select all the code (ctrl+a) and execute it (ctrl+enter). You have done! If everything has been done right, the highlighted part will now contain comments, as shown in figure:

screensh3

DONE.

Now you can be easily export the highlighted annotations on a PDF or word document. You can use the Acrobat option of exporting contents, as explained in this video. However, with the code we have now, you don't really need to do that: you can just uncomment (delete the double slash, //)  line 53 and execute the script again. This line will print on the screen the found highlighted words. If you want to also show some information about the highlighted part (just the annotation number and page) uncomment line 49. Execute the code and copy-paste the output anywhere. However, if you want, you can still use a more classic way, explained for example

The process of generating the comments could take some time, so be patient, even if the edito and the PDF will look freezed! If you really believe that the editor got stuck, try to press ctrl+shift+esc to interrupt the execution. If you have a lot of highlighted annotations (let's say, more than a hundres), it is a lot better to execute only chunk of text. To do that, change line 6 . For example, change it first to

for (var i = 0; i < 100;i++)

then to

for (var i = 100; i < 200;i++)

But always make sure that the second index will be not higher that the tot number of annotations.

PERSONAL NOTES:

I actually needed to export highlighted text because I was building a software to create the keywords index of a book, and the keywords were actually the highlighted part of the text itself. Since I never never used JavaScript before, this turned out to be quite a challenge. The way to get the text is extremely convoluted, and it' s really incredible that the Acrobat API doesn't have an easy way to do that. Briefly, I take the coordinate of the boxes describing the position of the annotations (the highlighted text). If the annotation goes through multiple lines, I'll get multiple coordinates and split them in chunk of 8. Fortunately, the annotation object also give me the page it is present. For each annotation, I will go through all the words within that page, take the coordinate of the words, and check if they are within the coordinates of the annotations. Strangely, the y coordinate starts from the bottom of the screen, instead the from above, as common in programming languages. This method will work also with words that are separated across two line.

I also noticed that, for some reason, if the word is close to some punctuation, the coordinates will mistakenly report an higher value (for example, for the word -hi- (within the dashes) the word coordinates will start and end when the dashes start and end (even if the actual word read if only hi). Therefore, if the only highlighted part is hi, the word will not be recognized, because the coordinates of the word will lie outside the coordinates of the highlighted box. To solve this problem, I just slightly increased the coordinates for the boxes (the +-4.5 or +-0.5 in the code), and this solved the problem completely.

Anyway, my first experience with JavaScript is totally positive. The language is intuitive and extremely similar to Java and C++, which I am more confident with. The real problem was the Acrobat API, which seems to behave in an unpredictable way.

-----------------------------

A big thanks to Francisco Morales which, with this post, pointed me to the right direction to develop this script.

If you appreciate my effort, please leave a comment!
If you have any problem with the script, don't be afraid to ask! 🙂

12 thoughts on “Copy highlighted text into comments from a PDF file

  1. Hi Valerio, thanks for sharing the code. It works and solves my problem. I really appreciate your effort and altruism. By the way, is that possible to make it a batch processing action script? Thanks again.

  2. Hello Valerio, your code is great, helped a lot with the task that I have in my hands now.
    Still would like to comment about 2 things I found out:
    1) some words seems to be missing, I just began to read the copied text so I still didn't try to identify a pattern to see if it could be improved on the code; and
    2) Is copying text only, no punctuations (" " - . , etc)
    Still a great code! Thank you for sharing!

    • Hi Vasco,

      that's strange. I tried my software with a file of 100+ pages, and it works (with some limitation, as said by the previous comments).
      I have no idea why it doesn't work for you. Maybe you can send me an example of your pdf file?

  3. Hi Valerio,
    Sorry for the late reply. I found that it does not work for cases where the words are highlighted using other non-adobe pdf software. In my case, I tried the script on a pdf file which was annotated using my GoodReader app in my iPad. It is only able to return highlighted words in the abstract and section headings, which are of different fonts compared to the main text. I suspect the cause is a combination of PDF software and font size. To verify if the problem comes from PDF software, I used a same pdf from http://www.mitpressjournals.org/doi/abs/10.1162/089976698300016963#.VI5yuHvjluQ. Then I highlighted a few sentences using PDF-XChange Viewer and Adobe Reader with, and without enabling the copy annotation feature. Next, I ran your script on the annotated file. I found that only the empty annotated texts annotated using Adobe Reader were copied, but not those annotated with PDF-XChange Viewer. Besides, those originally copied annotated text by PDF-XChange Viewer were overwritten. Hope this explains the problem clearly. I can send you a sample annotated file if you need it. Thank you again for this useful scripts.

  4. Is anyone interested in the Java version used to extract only the highlight annotations? My version seems to run very fast using the PDFBox library. One of the reasons I wanted to redo it in Java because the JS example was putting my CPU to 100% for like 15 minutes, for just 40 pages of PDF. My laptop got hot like crazy.

    • Hi B.Mike. If you have any faster solution, please let me know. I've tried the first code, but it's too slow to be practical at large number of highlighted words. Any help is appreciated.

      • Hello, sorry for the late response. Bellow there is the java source code to extract annotations using apache pdf box library. There is only a small problem. Because of the various fonts used in pdfs you need to adjust/center the highlight rectangle manually. See the lines of code with the comment: "adjust this offset". Once you managed to center it, it will work for the entire document. If you run the code as it is without adjusting the rectangle you will see that it works but it will output some other lines that you didn't highlight or trim the margins etc... Or maybe you can find a solution to take the font size into consideration when defining the rectangle. For me it was out of my scope 🙂
        Good luck.
        ----------------------------------------------------------------------------------------
        package com.pdf;

        import java.awt.geom.Rectangle2D;
        import java.io.File;
        import java.util.List;
        import org.apache.commons.lang3.StringUtils;

        import org.apache.pdfbox.pdmodel.PDDocument;
        import org.apache.pdfbox.pdmodel.PDPage;

        import org.apache.pdfbox.pdmodel.common.PDRectangle;
        import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
        import org.apache.pdfbox.util.PDFTextStripperByArea;

        public class ExtractAnnotations {

        private ExtractAnnotations() {
        }

        public static void main(String[] args) throws Exception {
        PDDocument doc = PDDocument.load(new File("/home/mike/Desktop/test.pdf"));
        try {
        List allPages = doc.getDocumentCatalog().getAllPages();
        for (int i = 0; i < allPages.size(); i++) {
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        PDPage page = (PDPage) allPages.get(i);
        List annotations = page.getAnnotations();
        //first setup text extraction regions
        for (int j = 0; j < annotations.size(); j++) {
        PDAnnotation annot = (PDAnnotation) annotations.get(j);
        PDRectangle rect = annot.getRectangle();
        //need to reposition link rectangle to match text space
        float x = rect.getLowerLeftX()-100; //adjust this offset
        float y = rect.getUpperRightY()+70;//adjust this offset
        float width = rect.getWidth()+50;//adjust this offset
        float height = rect.getHeight();
        int rotation = page.findRotation();
        if (rotation == 0) {
        PDRectangle pageSize = page.findMediaBox();
        y = pageSize.getHeight() - y;
        } else if (rotation == 90) {
        //do nothing
        }
        Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
        stripper.addRegion("" + j, awtRect);
        }

        stripper.extractRegions(page);

        for (int j = 0; j < annotations.size(); j++) {
        String extract = stripper.getTextForRegion("" + j);
        if (StringUtils.isNoneBlank(extract)) {
        System.out.println("**************"); //a separator
        System.out.println(extract);
        }

        }
        }

        } finally {
        if (doc != null) {
        doc.close();
        }
        }

        }
        }

  5. i did copy and paste as above but still not work.while giving control +enter at that moment showing.following things
    annots is not defined
    1:Console:Exec
    ReferenceError: annots is not defined
    1:Console:Exec
    undefined

Leave a Reply

Your email address will not be published. Required fields are marked *