====== Generate BOOK files from PDF (Word) ======

We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).

  * **On Desktop** with LibreOffice and Docspplit ([[http://documentcloud.github.com/docsplit/]])

Call the script //pdfTOtxt.sh// with directory of book directories as parameter:
<WRAP prewrap center>
<code>
giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/
</code>
</WRAP>

<WRAP prewrap center>
<code bash pdfTOtxt.sh>
#!/bin/bash

bdir=$1
cd "$bdir"

SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find openbess* -type d);
do
   echo "$bookdir"
   cd "$bookdir"
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.pdf -type f);
   do
      let "n += 1"
      filedoc="$nfile"
   done
   if [ $n -gt 1 ] || [ $n -lt 1 ]
   then
      echo "ERROR file PDF non unico"
      exit
   fi
   IFS=$SAVEIFS
   cp "$filedoc" doc.pdf
   rm "$filedoc"
   
   docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf
   
   cd OCR
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.txt -type f);
   do
      numer=${nfile#doc_}
      numero=${numer%\.txt}
      sn=$(printf "%04d" $numero)
      
      tr -d '\f' < "$nfile" > "$sn".txt
      rm "$nfile"
      
      echo "$sn"" DONE"
   done
   IFS=$SAVEIFS
  
   cd ~
   echo "DONE **************************""$bookdir"
done
exit
</code>
</WRAP>
The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, ...) in every book directory.
\\
\\
Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.

  * **On Back-end server** with ImageMagick and pdftk

Call the script //pdfatiff.sh// with directory of book directories as parameter:
<WRAP prewrap center>
<code>
#./pdfatiff.sh /srv/data/bookforingest
</code>
</WRAP>

<WRAP prewrap center>
<code bash pdfatiff.sh>
#!/bin/bash

bdir=$1

SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find "$bdir/"openbess* -maxdepth 0 -type d );
do
   
   echo "$bookdir"
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find "$bookdir/"*.pdf -type f);
   do
      let "n += 1"
      filepdf="$nfile"
   done
   if [ $n -gt 1 ] || [ $n -lt 1 ]
   then
      echo "ERROR file PDF non unico"
      exit
   fi
   
   mkdir "$bookdir""/pdfs"
   cp "$filepdf" "$bookdir""/pdfs"
   cd "$bookdir""/pdfs"
   
   pdftk "$filepdf" burst output pg-%04d.pdf
   
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find pg-*.pdf -type f);
   do
      let "n += 1"
      sn=$(printf "%04d" $n)
      filepdf="$nfile"
      echo "$filepdf"" -> ""$sn.tif"
      
      pdftk "$filepdf" output "temp.pdf"
      
      # For PDF from image
      #	convert -density 150 "temp.pdf" "$sn.tif"
      # For PDF from Word
      convert -background white -flatten -density 600 -resize 1200 -border 0.5% -bordercolor LightGray "temp.pdf" "../""$sn.tif"
      rm "temp.pdf"
   done
   cd ~/clineFC
   rm -R "$bookdir""/pdfs"
done
exit
</code>
</WRAP>
The script creates a single tif file for every pdf page (i.e. 0001.tif, 0002.tif, ...) in every book directory.
\\
\\
Book in now ready for ingesting.