====== Generate BOOK files from PDF (Word) ======
We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).
* **On Desktop** with LibreOffice and Docspplit ([[http://documentcloud.github.com/docsplit/]])
Call the script //pdfTOtxt.sh// with directory of book directories as parameter:
giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/
#!/bin/bash
bdir=$1
cd "$bdir"
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find openbess* -type d);
do
echo "$bookdir"
cd "$bookdir"
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find *.pdf -type f);
do
let "n += 1"
filedoc="$nfile"
done
if [ $n -gt 1 ] || [ $n -lt 1 ]
then
echo "ERROR file PDF non unico"
exit
fi
IFS=$SAVEIFS
cp "$filedoc" doc.pdf
rm "$filedoc"
docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf
cd OCR
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find *.txt -type f);
do
numer=${nfile#doc_}
numero=${numer%\.txt}
sn=$(printf "%04d" $numero)
tr -d '\f' < "$nfile" > "$sn".txt
rm "$nfile"
echo "$sn"" DONE"
done
IFS=$SAVEIFS
cd ~
echo "DONE **************************""$bookdir"
done
exit
The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, ...) in every book directory.
\\
\\
Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.
* **On Back-end server** with ImageMagick and pdftk
Call the script //pdfatiff.sh// with directory of book directories as parameter:
#./pdfatiff.sh /srv/data/bookforingest
#!/bin/bash
bdir=$1
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find "$bdir/"openbess* -maxdepth 0 -type d );
do
echo "$bookdir"
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find "$bookdir/"*.pdf -type f);
do
let "n += 1"
filepdf="$nfile"
done
if [ $n -gt 1 ] || [ $n -lt 1 ]
then
echo "ERROR file PDF non unico"
exit
fi
mkdir "$bookdir""/pdfs"
cp "$filepdf" "$bookdir""/pdfs"
cd "$bookdir""/pdfs"
pdftk "$filepdf" burst output pg-%04d.pdf
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find pg-*.pdf -type f);
do
let "n += 1"
sn=$(printf "%04d" $n)
filepdf="$nfile"
echo "$filepdf"" -> ""$sn.tif"
pdftk "$filepdf" output "temp.pdf"
# For PDF from image
# convert -density 150 "temp.pdf" "$sn.tif"
# For PDF from Word
convert -background white -flatten -density 600 -resize 1200 -border 0.5% -bordercolor LightGray "temp.pdf" "../""$sn.tif"
rm "temp.pdf"
done
cd ~/clineFC
rm -R "$bookdir""/pdfs"
done
exit
The script creates a single tif file for every pdf page (i.e. 0001.tif, 0002.tif, ...) in every book directory.
\\
\\
Book in now ready for ingesting.