Thanks..
I will investigate the lucene project.
Just one more thing about why the possibility to handle pdf is so high on my which list.
When I make a new design of manifold for example. I often download data sheets of cartridge, valves etc. They are near 100% in pdf. The big distributors of hydraulic components have so complex web pages that it is near impossible to find what you are looking for. Even the sales men don't use them. They search instead in their catalogs.
Take for example parker hydraulics
http://www.parker.com/portal/site/PARKER/menuitem.b90576e27a4d71ae1bfcc510237ad1ca/?vgnextoid=c38888b5bd16e010VgnVCM1000000308a8c0RCRD&vgnextfmt=default.
Try to find some data sheet for solenoid for direct acting on/off valve size NG6.
So
when you found it you don't want to loose it. It always come in some cryptic file name based on something I don't understand. To find it again you must always rename it with some description of it content.
Anyway some tips/suggestion to our other readers (if we have some)...
If you are only interested to get a method to import text from your word processor word/writer then you could get some inspiration from this link.
http://www.dokuwiki.org/tips:doc_to_wiki_syntax?&#dokuwiki__top
It's about openoffice in headless mode.
But if you are
newbie like me and don't want to spend several hours on ...how to install/ why it won't work /and how to fix it. continue to read.
If you only have text and only a few docs the easiest way is with a macro. I would suggest you look at writertools for Open/Libre office.
http://code.google.com/p/writertools/
But if you have included pictures....
In this link
http://www.dokuwiki.org/tips:openofficemacro you get a feeling that you could edit this
http://www.ooowiki.de/Writer2DokuWiki and get the picture stuff working. That isn't the case. The script will not export the picture from your document you will only get a link to a picture starting with grafik1.png. And you won't find that file if you look inside your .odt file with you archive manager (windows users use winrar).
So instead, maybe following approach could work.
* Drop your document into a folder.
* Run a script that will fix the filename (lowercase and remove spaces). The script will then make a new folder based on the new filename and copy the file to that location.
* With this batch converter macro
http://oooconv.free.fr/batchconv/batchconv_en.html you could traverse the folder tree and ave as html.
* Copy the folder tree into your folder for import to DW. Best location would be under data/media/your import folder.
* Run another script that will traverse that folder and run each html file with the tool html2wiki and save the result txt file under data/pages/docs..
OpenOffice will not make strange html like pdftohtml does. So I think it will be okey, haven't tested it yet by my self.
To write those 2 script you could with little effort tweak my script bellow.
It's useful to use the integrated bash debugger that is activated with set -x.
Or you could use
http://bashdb.sourceforge.net/
#!/bin/bash
# file pdfimport.sh
# Convert pdf to wiki syntax
# Take pdf file and rominize it and make a dir based on pdf file name.
# Execute pdftohtml and html2wiki on the pdf file
# Fix image url to dokuwiki syntax
# Replace the \\ with linefeed
#####################################
# LOGGFILE full path and file name ex. /var/www/dokuwiki/data/pages/config/logg.txt
# Remember dir must end with /
LOGGFILE=/var/www/dokuwiki/data/pages/logg.txt
IMPORT_ROOT=/var/www/dokuwiki/data/media/pdf_import/
PAGES_DIR=/var/www/dokuwiki/data/pages/pdf_imported/
DOKUWIKI_MEDIA_ROOT=/var/www/dokuwiki/data/media/
# pdftohtml -c "for complex"
PDFTOHTML="/var/www/pdftohtml -c -noframes -enc UTF-8"
HTML2WIKI="html2wiki --dialect DokuWiki"
# Set to true for configcheck (dir exists and script pdftohtml and html2wiki could be executed
RUN_CFG_CHK=true
clear
#set -x
cfgchk () {
# checking if I can write to $LOGGFILE, $IMPORT_DIR and $PAGES_DIR ..
if [ ! -w $LOGGFILE ]; then echo "<H1>Can´t write to loggfile"; exit 1; fi
if [ ! -w $IMPORT_ROOT ]; then echo "<H1>Can´t write to $IMPORT_ROOT EXIT.."; exit 1; fi
if [ ! -w $PAGES_DIR ]; then echo "<H1>Can´t write to $PAGES_DIR EXIT... "; exit 1; fi
#if [ $IMPORT_ROOT != $DOKUWIKI_MEDIA_ROOT ]; then echo "<H1> I can´t find $DOKUWIKI_MEDIA_ROOT inside $IMPORT_ROOT EXIT.. "; exit 1; fi
# cheking if pdftohtml and html2wiki could be executed ...
command -v $PDFTOHTML >/dev/null || { echo "<H1>Can´t execute $PDFTOHTML EXIT.."; exit 1;}
command -v $HTML2WIKI >/dev/null || { echo "<H1>Can´t execute $HTML2WIKI EXIT.."; exit 1;}
echo 'You could turn the cfgchk off now e.g RUN_CFG_CHK=false' >> "$LOGGFILE"
}
# Mark the row bellow or set $RUN_CFG_CHK=false after you have check it is okey
if $RUN_CFG_CHK ; then cfgchk; fi
# loggfile 'basename $0 will give the running scripts name
# writing to logg the script name and date and time
echo "==== Script \"`basename $0`\" is started ... $(date -u) ====" >> "$LOGGFILE"
# Main loop
find $IMPORT_ROOT -maxdepth 1 -iname '*.pdf' | while read file;
do
# instead of string chopping you could use basename and filename
oldfilename=$(basename "$file")
# write to logg what we have found (only .pdf files )
echo "== Found file ... \"${file##*/}\" ... ==" >> "$LOGGFILE"
# Rominazie the file
newfilename=$(echo "$oldfilename" | tr 'A-Z' 'a-z' | tr ' ' '_' |
tr 'Å' 'å' | tr 'Ä' 'ä' | tr 'Ö' 'ö' |
sed 's/_-_/-/g');
if [ "$newfilename" != "$oldfilename" ]; then
# Write to logg the new name
# echo -n "text" >> file.txt (will write "text" in the beginning of file.txt with no line feed)
# the code bellow will write to the end of the file with no line feed your sed implementation must support -i option
# sed -i.bck '$s/$/After rominize...... '"$newfilename"'/ ' "$LOGGFILE"
echo "== After rominize .... $newfilename" >> "$LOGGFILE"
fi
# make dir based on newfilename
newdir="$IMPORT_ROOT""${newfilename%%.*}"
# create new dir
mkdir -p "$newdir"
echo "== Have created a new dir .. $newdir ==" >> "$LOGGFILE"
# copy the file no new dir with new name (new if rominized)
cp "$IMPORT_ROOT$oldfilename" "$newdir/$newfilename"
echo "== Copy $IMPORT_ROOT$newfilename" to "$newdir/$newfilename ==" >> "$LOGGFILE"
# remove old file
rm "$IMPORT_ROOT/$oldfilename"
echo "== Removed $IMPORT_ROOT$oldfilename ==" >> "$LOGGFILE"
# convert from pdf to html using the -c option (complex)
# mark that -enc is not shown in the help for pdftohtml it is in xftp
# Exec pdftohtml and write it to $LOGGFILE
$PDFTOHTML $newdir/$newfilename $newdir/${newfilename%%.*}.html >> "$LOGGFILE"
# Convert html to wiki syntax and fix the right path for pictures
# first create a dummy url so we can find and replace that with correct path to pictures in wiki syntax
dummy=/dummy/
# Convert to wiki syntax in the same dir as rominized pdf file
echo "$HTML2WIKI --base-uri /dummy/ $newdir/${newfilename%%.*}.html $newdir/${newfilename%%.*}.txt" >> "$LOGGFILE"
$HTML2WIKI --base-uri /dummy/ $newdir/${newfilename%%.*}.html > $newdir/${newfilename%%.*}.txt
# We take $IMPORT_ROOT and chop off $DOKUWIKI_MEDIA_ROOT and add $newdir with the $IMPORT_ROOT chopped off
mediapath="${IMPORT_ROOT##"$DOKUWIKI_MEDIA_ROOT"}${newdir##"$IMPORT_ROOT"}"/
# Repacing all the / with wiki's :
mediapath=":$(echo "$mediapath" | tr '/' ':')"
echo "$mediapath"
# Replacing the dummy string "/dummy/" with the correct wiki syntax path -> $mediapath in the .txt file
# sed -i (-i in place= "on the spot")
sed -i "s/\/dummy\//$mediapath/g" "$newdir/${newfilename%%.*}".txt
# Replace \\ with line feed
sed -i "s/\\\/\n/g" "$newdir/${newfilename%%.*}".txt
cp "$newdir/${newfilename%%.*}".txt "$PAGES_DIR/${newfilename%%.*}".txt
done