Home –  JavaScript
Tag Archives: JavaScript

Copy highlighted text into comments from a PDF file

One of the nice features of Acrobat is that you can highlight text and then export only the highlighted part into a different document.  However, in order to do that, the user has to remember to tick the option "Copy selected text into Highlight, Cross-Out, and Underline comment pop ups" in Edit - Preferences - Commenting. Unfortunately, this setting is not on by default and only available  in Acrobat (8,9 and X) but not in Acrobat Reader, as said here.

It can happens that we highlight a really big document in order to export the highlighted parts, and then we remember that the "Copy selected text..." was actually off! Our highlighted parts won't be commented, meaning that we cannot export them (actually, we *can* export them, but they will be just empty boxes). There is no way to retroactively  copy all the highlighted part into comment from the graphic interface.

screensh2

However, it is possible to solve this problems by using some code. This solution will also work if you have Acrobat Reader!

There is already a software online that does that, on this website, but the price is really high (from $40 to $75!), so I developed my own script and I am going to put it out for free:

HOW TO RETROACTIVELY COPY HIGHLIGHTED TEXT INTO COMMENTS IN ACROBAT PDF

1. OPEN JAVASCRIPT EDITOR. First of all open the document with the highlighted text, and press ctrl+j. In Acrobat Pro, this should open the JavaScript editor window. If it works, just go to point 2, otherwise keep reading.

It ctrl+j doesn't do anything, than you don't have Acrobat Pro, but don't worry, you can still open the JavaScript editor. Go on this website and download the Reader JavaScript Console Window. Download the file anywhere, and follow the instruction in the ReadMe file. Once done that, you should see a new voice in the PDF menu, "Extension", and clicking on "Debugger" should open the JavaScript editor window. Done!

2. COPY THE SCRIPT. Delete any text on the bottom window, and copy and paste this code:

var annots = this.getAnnots({nSortBy: ANSB_Page});
console.println("nAnnot Report for document: " + this.documentFileName);
if ( annots != null ) {
console.println("Number of Annotations: " + annots.length);
var annotList=[];
for (var i = 0; i < annots.length;i++) {
        
        var annotTxt="";
        while (annots[i].type!="Highlight") {
            annotTxt="****";
            i=i+1;
        }
        if (i>=annots.length) {
             break;
        }
      
        pageNum=annots[i].page;
        var quadAN=annots[i].quads.toString();
        var qaAN=quadAN.split(",");
        for (var ii=0; ii<qaan.length; ii++)="" {=""   =""    =""  qaan[ii]="parseFloat(qaAN[ii]);"  }=""  for="" (var="" w="0;" w<getpagenumwords(pagenum);="" w++)=""  var="" quadwd="getPageNthWordQuads(pageNum,w).toString();"                        ="" console.println("qwd="" type:="" "="" +="" typeof="" ":="" quadwd)="" qawd="quadWD.split(",");" ii="0;" ii<qawd.length;=""  qawd[ii]="parseFloat(qaWD[ii]);"  ="" nlines="qaAN.length/8;" counter="0;" nn="0;" nn<nlines;="" nn++)=""  if="" (qawd[0]="">=qaAN[counter+0]-4.5 &&
                    qaWD[1]<=qaAN[counter+1]+0.5 &&
                    qaWD[2]<=qaAN[counter+2]+4.5 &&
                    qaWD[3]<=qaAN[counter+3]+0.5 &&
                    qaWD[4]>=qaAN[counter+4]-4.5 &&
                    qaWD[5]>=qaAN[counter+5]-0.5 &&
                    qaWD[6]<=qaAN[counter+6]+4.5 &&
                    qaWD[7]>=qaAN[counter+7]-0.5) {
                    annotTxt=annotTxt+" "+getPageNthWord(pageNum,w);
                }
            counter=counter+8;
            }
        counter=0;
            
        }
        //UNcomment one line below if you want to show information about the annotations
        //annotTxt="ANNOT N." + i + " PAGE NUM: " + (pageNum+1) + " : " + annotTxt;
        annots[i].contents=annotTxt;
        annotList[annotList.length]=annotTxt;
        //UNcomment the line below if you want to print on the screen the annotations
                //console.println(annotTxt)

}

} else
console.println(" No annotations in this document.");

console.println("DONE!")

</qaan.length;>

Select all the code (ctrl+a) and execute it (ctrl+enter). You have done! If everything has been done right, the highlighted part will now contain comments, as shown in figure:

screensh3

DONE.

Now you can be easily export the highlighted annotations on a PDF or word document. You can use the Acrobat option of exporting contents, as explained in this video. However, with the code we have now, you don't really need to do that: you can just uncomment (delete the double slash, //)  line 53 and execute the script again. This line will print on the screen the found highlighted words. If you want to also show some information about the highlighted part (just the annotation number and page) uncomment line 49. Execute the code and copy-paste the output anywhere. However, if you want, you can still use a more classic way, explained for example

The process of generating the comments could take some time, so be patient, even if the edito and the PDF will look freezed! If you really believe that the editor got stuck, try to press ctrl+shift+esc to interrupt the execution. If you have a lot of highlighted annotations (let's say, more than a hundres), it is a lot better to execute only chunk of text. To do that, change line 6 . For example, change it first to

for (var i = 0; i < 100;i++)

then to

for (var i = 100; i < 200;i++)

But always make sure that the second index will be not higher that the tot number of annotations.

PERSONAL NOTES:

I actually needed to export highlighted text because I was building a software to create the keywords index of a book, and the keywords were actually the highlighted part of the text itself. Since I never never used JavaScript before, this turned out to be quite a challenge. The way to get the text is extremely convoluted, and it' s really incredible that the Acrobat API doesn't have an easy way to do that. Briefly, I take the coordinate of the boxes describing the position of the annotations (the highlighted text). If the annotation goes through multiple lines, I'll get multiple coordinates and split them in chunk of 8. Fortunately, the annotation object also give me the page it is present. For each annotation, I will go through all the words within that page, take the coordinate of the words, and check if they are within the coordinates of the annotations. Strangely, the y coordinate starts from the bottom of the screen, instead the from above, as common in programming languages. This method will work also with words that are separated across two line.

I also noticed that, for some reason, if the word is close to some punctuation, the coordinates will mistakenly report an higher value (for example, for the word -hi- (within the dashes) the word coordinates will start and end when the dashes start and end (even if the actual word read if only hi). Therefore, if the only highlighted part is hi, the word will not be recognized, because the coordinates of the word will lie outside the coordinates of the highlighted box. To solve this problem, I just slightly increased the coordinates for the boxes (the +-4.5 or +-0.5 in the code), and this solved the problem completely.

Anyway, my first experience with JavaScript is totally positive. The language is intuitive and extremely similar to Java and C++, which I am more confident with. The real problem was the Acrobat API, which seems to behave in an unpredictable way.

-----------------------------

A big thanks to Francisco Morales which, with this post, pointed me to the right direction to develop this script.

If you appreciate my effort, please leave a comment!
If you have any problem with the script, don't be afraid to ask! 🙂