Dokuwiki as a DMS

pernils

Hi .

My name is Pernils and I been struggling to make DW as a backbone for DMS.

The basic idea was:

Samba share for make a folder inside the ..dokuwiki/data/media/ accesible outside the DW enviroment.
The inserted pdf files to that folder would then go throu the process: pdftohtml -> html2wiki.

The resault (...txt file) would then be inserted into DW to make the doucment searchable.

I have used all my spare time for over 2 months on this project just to discovered that the output from html2wiki will not be usable. Html2wiki will produce just garbage.

After researche I found that there is a php lib http://simplehtmldom.sourceforge.net/ that could be used to write a php html parser.
Due the fact I have wasted so many hour on this and I have only written some simple hello world in php I have decided to stop here.

So my question is:

Does someone have a another html2wiki parser?

Some attempts have been done by Andreas Gohr http://www.freelists.org/post/dokuwiki/dokuwiki2html. (would be intressted to know how it worked).

My wiki thread on this project could be read on http://www.dokuwiki.org/sv:samba_pdf_to_wiki (sorry only in swedish)

If someone is intressted maybe Håkan Sandel (du verkar vara aktiv i DW) could extract the most imported parts and translate to English. Personaly I have lost all my energy after been defeted by the garabe output from html2wiki.

hansbkk

Why have you got stuff in PDF to start with? That's a lousy "master source" format for content intended to be converted to other formats. PDF should be a "final output" format generated from a plaintext "light markup" syntax, usually after tweaking in LaTeX.

I also consider HTML an output format, rarely does conversion *from* HTML work well for anything other than headers and simple inline formatting like bold/italic etc.

I suggest taking a look at the txt2tags and pandoc projects - the latter is much more actively developed these days, with good support in their googlegroups mail list. reST/Sphinx is also a good mature environment, but much more limited publishing output choices, HTML and PDF only I believe, check out a reasonably complete matrix here:
https://docs.google.com/spreadsheet/pub?key=0Ali3YyiUxdiAdG1MdzJGcW1wQUxrUG85WTVLWXBVeVE&output=html

You may not want to use DW's markup syntax unless you create reader/writer modules for pandoc to automate conversion to more standard formats, but it will work fine as a "transparent carrier" to allow for collaborative editing.

You may also want to look at the various wiki platforms that run on top of distributed VCS if you've got a small group, don't need general access via the Internet. Gitit is maintained by the author of pandocs so uses its extended markdown syntax, obviously rides on Git.

Hope this helps.

ach

pernils wrote The basic idea was:

Samba share for make a folder inside the ..dokuwiki/data/media/ accesible outside the DW enviroment.
The inserted pdf files to that folder would then go throu the process: pdftohtml -> html2wiki.

The resault (...txt file) would then be inserted into DW to make the doucment searchable.

Why exactly did you try to do that? I can think of two scenarios:
a) You (or your company or a client) have a lot of PDFs and want to get them into a wiki (with its advantages: searchable, etc).
b) You'd like to do this as a general exercise (e.g. for uni) and think that others would need something like that as well.

In the case of a), it might be better and cheaper and quicker to hire someone to manually add all the information from the PDFs into the wiki. This, obviously, depends on the amount of PDFs that needs to be converted. But at least the quality will be much better.

Or is there any other reason why you'd like to do that?

pernils

The reason that I have chosen pdf is that is available from many program in my working environment (writer, solid edge, word ..)

When I wrote that I would use DW as a DMS i meant it would just act a indexer of the content of the inserted docs (drawings doc etc ..). Each new created wiki page would have a link to it's original pdf file if you want to print mail or whatever....

Did it make sense? hmm I don't think so. I will try explain again.

I am searching for lightweight platform for adding data . could be ..customer list, phone book, field service, meetings, manuals, problem solving ... you name it.
The main feature would to search through all this content.

When i comes to complex manuals with pictures it's much easier to use a local tool like writer for example. Therefore some import function is needed. PDF seems to be the best choice for the carrier.

I have tried Alfresco but due it's written in java it rely sucks power from the hardware.. anyway the left over box i tested soon stopped responding and continue to spin the harddrive to I turn of the power. I didn't fully evaluate it possibilities but I don't think it could import pdf or indexing the uploaded documents.

Have tried some other tools also but I don't like when they don't preserve the filenames.

Then I discovered DW and then link http://www.dokuwiki.org/tips:doc_to_wiki_syntax?&#dokuwiki__top . Fine this could relay be something. The after all research compiling pdftohtml html2wiki and bash scripting. I had a automatic working solution but with the drawback ... result was garbage. Probably 200 man hour down the drain ...

I have dropped some html snippet from the pdftohmtl output http://www.dokuwiki.org/playground:html if someone thinks "this look interesting".

It must be more people out there then just me looking for the same solution.

But thanks HansBKK for pointing to the pandoc tool.

pernils

ACH ...

It's the scenario a)

Today we have not any tool for collaborate all the documents .. they are scattered all over the network. I know that DW maybe is not the best tool for this .. but could be step on the way for merging all that data into one place and to make it easier for others to find what they are searching for.

ach

HansBKK wrote Why have you got stuff in PDF to start with? That's a lousy "master source" format for content intended to be converted to other formats. PDF should be a "final output" format generated from a plaintext "light markup" syntax, usually after tweaking in LaTeX.

I fully agree with that.

pernils wrote When I wrote that I would use DW as a DMS i meant it would just act a indexer of the content of the inserted docs (drawings doc etc ..). Each new created wiki page would have a link to it's original pdf file if you want to print mail or whatever....

I think that's a bad idea. You'll nearly completely lose the power of a wiki with that solution. I still don't understand why PDF needs to be the input format. I would rather use the wiki normally with its wiki syntax. And if you need a PDF later on, you can always simply export it (with either the dw2pdf or the odt plugin).

pernils wrote I am searching for lightweight platform for adding data . could be ..customer list, phone book, field service, meetings, manuals, problem solving ... you name it.
The main feature would to search through all this content.

Sounds like a perfect job for a wiki to me (and not a bunch of PDFs). ;-)

pernils wrote When i comes to complex manuals with pictures it's much easier to use a local tool like writer for example. Therefore some import function is needed. PDF seems to be the best choice for the carrier.

I very much disagree with that. Using only local tools is what usually gets you into that mess you are now in. ;-)
Why do you think writer is an easier tool than DokuWiki? There are very few things that cannot be done with DokuWiki. And it's usually easier to mess things up and lose focus with a tool like writer.

And PDF is not a good choice at all for a carrier, except as a carrier to the printer.

pernils wrote ACH ...

It's the scenario a)

Today we have not any tool for collaborate all the documents .. they are scattered all over the network. I know that DW maybe is not the best tool for this .. but could be step on the way for merging all that data into one place and to make it easier for others to find what they are searching for.

I'd say DokuWiki is a perfect tool for that! You know what? That's exactly the same scenario (that will be all too familiar to a lot of people) that got DokuWiki started in the first place! Back in 2004 our company was fed up with all the mess and different documents on the fileserver and Andreas Gohr was given the task to install a wiki to improve our intranet. But because he couldn't find one which he liked, he just created his own. The rest is, as they say, history...

But DokuWiki is not the only tool for that. Some other wiki engines or txt2tags and pandoc (as HansBKK pointed out) could serve the same purpose with equally good results.

hansbkk

Thanks for clarifying your use case, much easier to give advice more concisely.

If you aren't already saddled with a bunch of pre-existing content, avoid PDF except as a way to "publish final" pretty-printed output. Don't focus on the authoring software, focus on the choice of "master source" syntax, and select/adapt your tools to suit that. Once you've got a lot of content, you're kind of "locked in" to the capabilities of the related tools unless you're willing to sponsor new software.

IMO Generic plain text is the best carrier format, with one of the "light markup" syntaxes to handle formatting.

The only problem with DW is that it's web-based UI is based on its "proprietary" syntax. However if that's going to be the "master content" home/carrier, then it shouldn't be too hard to convert from that syntax to say Pandoc for further publishing. I think I pointed out elsewhere that the author of Zim has created a python script that converts from zim-wiki syntax (very very close to DW) to pandoc.

Once that's in place, then DW can be the authoring/editing environment in many cases, but those that prefer other text processing tools can still use those directly.

If pandoc becomes important to your workflow, then a major contribution to the DW project would IMO be a plugin that allowed it to support Pandoc directly, both in the editor UI and in the rendering engine.

And learn from your experience - get to know the individual tools properly before assuming they will become part of your toolchain. There's no such thing as "wiki text" other than a general concept (which I refer to as "lightly marked up plain text"), there are hundreds of syntax flavors and for non-programmers converting accurately between these is not a trivial problem domain.

pernils

ach wrote
HansBKK wrote Why have you got stuff in PDF to start with? That's a lousy "master source" format for content intended to be converted to other formats. PDF should be a "final output" format generated from a plain text "light markup" syntax, usually after tweaking in LaTeX.
I fully agree with that.

When writing stuff to explain some function for product or service instruction you want to include so many pictures as possible. The pictures that we/I use is mostly generated from our 3D cad software.
When I use writer I would then have a break in the update chain. If I make a adjustment to 3d model i must redo the picture and paste back into writer.
So for document that would include many illustration I have used the draft environment in Solid Edge. Then I can edit the 3d models how much I want .. and open the draft file and press "update links" and all pictures that is generated from 3D models will be updated.
2D export possibilities from Solid Edge is .pdf, .dxf, .dwg and is native format .dft.

As you can see the best choice in this case is pdf.

When we come to word processors .. we have 2 groups. 1 group that is flexible (like me) that could use what ever word processor but have been chosen Libre Office. The other group is to lazy to relearn another GUI (using word). In my opinion wordpad would be more than enough for how they use to write. They don't use format etc .. you know how it works.

As you can see the best generic carier to get this content into DW is pdf. I'm not so interested to get the for example a manual back from DW into pdf for mailing. Then it's better to get the orginal pdf.
DW will help me to find the right manual due to it search function. The page with the most hits must best explain the keywords.

ach wrote
pernils wrote When I wrote that I would use DW as a DMS i meant it would just act a indexer of the content of the inserted docs (drawings doc etc ..). Each new created wiki page would have a link to it's original pdf file if you want to print mail or whatever....
I think that's a bad idea. You'll nearly completely lose the power of a wiki with that solution. I still don't understand why PDF needs to be the input format. I would rather use the wiki normally with its wiki syntax. And if you need a PDF later on, you can always simply export it (with either the dw2pdf or the odt plugin).

It's a dunting task to make document/page in DW that contains a lot of pictures. You have the upload procedure and you don't see the result before you press preview. Okey you have some WYSIWYG editor possibility in DW but will not be as fast as with a local tool.

ach wrote
pernils wrote I am searching for lightweight platform for adding data . could be ..customer list, phone book, field service, meetings, manuals, problem solving ... you name it.
The main feature would to search through all this content.
Sounds like a perfect job for a wiki to me (and not a bunch of PDFs). ;-)

I agree .. but not when it comes to embedded pictures.

So my struggling is to use DW as indexer for documents made from different tools (write, word soldiegde etc ..) but whit the bonus that it have the wiki possibilities.

hmm.. My native language is Swedish not English .. i think when I in the future look back on this I want understand my self either.

pernils

I have include the html output from pdftohtml and erased it from the playground. Just to keep this thread self contained.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<TITLE>��</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<META name="generator" content="pdftohtml 0.39">
<META name="author" content="��">
<META name="keywords" content="��">
<META name="date" content="2009-02-27T13:17:53+00:00">
<META name="subject" content="��">
</HEAD>
<BODY bgcolor="#A0A0A0" vlink="blue" link="blue">
<!-- Page 1 -->
<a name="1"></a>
<DIV style="position:relative;width:892;height:1263;">
<STYLE type="text/css">
<!--
	.ft0{font-size:9px;font-family:Helvetica;color:#000000;}
	.ft1{font-size:27px;font-family:Helvetica;color:#000000;}
	.ft2{font-size:18px;font-family:Helvetica;color:#000000;}
	.ft3{font-size:9px;line-height:13px;font-family:Helvetica;color:#000000;}
	.ft4{font-size:18px;line-height:23px;font-family:Helvetica;color:#000000;}
-->

</STYLE>
<IMG width="892" height="1263" src="ventilpatroner001.png" alt="background image">
<DIV style="position:absolute;top:56;left:806"><span class="ft0">1 (2)</span></DIV>
<DIV style="position:absolute;top:82;left:725"><span class="ft0">2009-02-26 - 62902</span></DIV>
<DIV style="position:absolute;top:1165;left:547"><span class="ft3">SMP Parts AB, Bergsjövägen 3 82071 ILSBO Sweden<br>Tel + 46 (0)650 35650, Fax + 46 (0)650 35660<br>E-mail: info@smpparts.com</span></nobr></DIV>
<DIV style="position:absolute;top:82;left:69"><span class="ft1">VENTILPATRONER</span></DIV>
<DIV style="position:absolute;top:177;left:68"><span class="ft2">Ventilpatrons programmet kommer i huvudsak ifrån Integrated Hydraulics.</span></DIV>
<DIV style="position:absolute;top:224;left:68"><span class="ft2">Allmänt om ventiler och deras benämning samt dess funktion.</span></DIV>
<DIV style="position:absolute;top:271;left:68"><nobr><span class="ft4">Riktningsventil -&gt; ventiler som bestämmer riktning på oljan eller till vilka portar<br>som oljan skall ansättas.</span></nobr></DIV>

<DIV style="position:absolute;top:340;left:68"><nobr><span class="ft2">Tryckreducerare -&gt; Avgränsar det inställda trycket genom att dränera till tank.</span></nobr></DIV>
<DIV style="position:absolute;top:387;left:68"><nobr><span class="ft4">Tryckbegränsare -&gt; Avgränsar det inställda trycket genom att bara släppa<br>igenom olja tills trycket har uppnåts.</span></nobr></DIV>
<DIV style="position:absolute;top:457;left:68"><nobr><span class="ft4">Chockventil -&gt; Kontrollera trycket i en port för att släppa över övertrycket till<br>annan port.</span></nobr></DIV>
<DIV style="position:absolute;top:527;left:68"><nobr><span class="ft2">Back ventil -&gt; Säkerställa att oljan bara går åt ett håll.</span></nobr></DIV>

<DIV style="position:absolute;top:573;left:68"><nobr><span class="ft4">Pilot styrd backventil -&gt; Säkerställa att trycket inte kan återgå förrän ett pilot tryck<br>öppnar käglan och gör det möjligt.</span></nobr></DIV>
<DIV style="position:absolute;top:643;left:68"><nobr><span class="ft4">Overcenter -&gt; Har samma funktion som pilot styrd backventil men den funtion att<br>om trycket skulle överstiga det instälda så kommer &quot;back ventilen&quot; att öppna sig<br>ändå.</span></nobr></DIV>
</DIV>
<!-- Page 2 -->
<a name="2"></a>

<DIV style="position:relative;width:892;height:1263;">
<STYLE type="text/css">
<!--
	.ft5{font-size:15px;font-family:Helvetica;color:#007f7f;}
	.ft6{font-size:15px;line-height:19px;font-family:Helvetica;color:#007f7f;}
-->
</STYLE>
<IMG width="892" height="1263" src="ventilpatroner002.png" alt="background image">
<DIV style="position:absolute;top:56;left:806"><nobr><span class="ft0">2 (2)</span></nobr></DIV>
<DIV style="position:absolute;top:82;left:725"><nobr><span class="ft0">2009-02-26 - 62902</span></nobr></DIV>
<DIV style="position:absolute;top:1165;left:547"><nobr><span class="ft3">company adress etc ...<br>Tel + ******<br>E-mail: info@company.com</span></nobr></DIV>
<DIV style="position:absolute;top:82;left:69"><nobr><span class="ft1">VENTILPATRONER</span></nobr></DIV>
<DIV style="position:absolute;top:207;left:268"><nobr><span class="ft6">65990<br>VENTIL PATRON 1PA100</span></nobr></DIV>

<DIV style="position:absolute;top:382;left:261"><nobr><span class="ft6"> 60222<br>VENTIL PATRON 1PA60</span></nobr></DIV>
<DIV style="position:absolute;top:514;left:266"><nobr><span class="ft6">62948<br>VENTIL PATRON 1CLLR100 (@210BAR)</span></nobr></DIV>
<DIV style="position:absolute;top:694;left:263"><nobr><span class="ft6">62902<br>VENTIL PATRON 1AR60-P-35S SET@210BA</span></nobr></DIV>
<DIV style="position:absolute;top:829;left:263"><nobr><span class="ft6">63374<br>VENTIL PATRON 1CER30 (@210 BAR)</span></nobr></DIV>
<DIV style="position:absolute;top:979;left:262"><nobr><span class="ft6">636121<br>VENTIL 4CK30-3s</span></nobr></DIV>
</DIV>

</BODY>
</HTML>

Does this forum have syntax highlighting ?

pernils

Just to give some example of pdf file.

http://www.eaton.com/ecm/groups/public/@pub/@eaton/@hyd/documents/content/pct_274365.pdf

This is much more complex than I will produce but to manually export this into DW will take a lot of effort. I tested to run this throw the pdftohtml tool and it looked fair enough. But when i moved on with html2wiki it will be just rubbish.

Maybe some php string chopping enthusiast find this as a fun task and will contribute to extend the DW possibilities. My self with just some php skills find this task going over my head.

But it would be good exercise for learning string chopping in php ... maybe in the future ...

hansbkk

I'm not trying to sell you on using DokuWiki as everyone's editor, but I would still recommend standardizing an optimal workflow if at all possible, and IMO at least the text portion should be kept as "master source" in plaintext.

You could then output to two separate "next stages":
A. generate say HTML or RTF, whatever the "word processors" the users are working for the pretty-print formatting can accept.
B. DokuWiki - I'm still not clear what the need is there, just indexing key-word lookups?

If you decide on a way to include/reference the graphic files in the master source, and can allow the later-revised graphic files to have the same name as they had before, no updates are needed for the links, just upload the new files.

Or look for software that's intended to be a DMS and indexes the text within the PDFs for you.

ach

pernils wrote This is much more complex than I will produce but to manually export this into DW will take a lot of effort. I tested to run this throw the pdftohtml tool and it looked fair enough. But when i moved on with html2wiki it will be just rubbish.

Looking at the example code above, I have to say that you are wrong. It's pdftohtml which is bad and not html2wiki! Because the output html is so bad, no wonder that html2wiki cannot do anything with it. Although, it might not be pdftohtml's fault, but the original PDF could already be bad. (PDF != PDF)

ach

pernils wrote Does this forum have syntax highlighting ?

I edited your post to include syntax highlighting ([ code=html ]).

pernils

HansBKK wrote If you decide on a way to include/reference the graphic files in the master source, and can allow the later-revised graphic files to have the same name as they had before, no updates are needed for the links, just upload the new files.

If I take the example to write a document for all the different valves have been using in our products. The most flexible tool is to do it right in the Solid Edge 3D software. If some one make adjustment to a part (valve) make it look better I just have to open the draft and press update and the illustration will update it self.

If I use a "static" picture like working in word .. I have to:
fire up the 3D software
make a new draft
insert the part (valve/ engine or whatever) with the right postition
save as jpg (gives better resolution) or mark -> copy
paste it in word and sometimes use the crop function inside writer/word.

The more step it takes the liker it wouldn't be done.

I will mail you a link to my working environment ...

Hmm access denied .. I don't want to publish the url in the public.

pernils

So to getting back to topic .. more how and less why ...

pdftohtml is producing the same html output from all the different "pdf making sources" I have tried.

It's build on xpdf.. maybe gnupdf will do a better job in the future. Or .. pdftohtml as a single app. http://sourceforge.net/projects/pdftohtml/files/ is some what abandon ware .. It was last update 2006. But it is included in the poppler-utils package. But debian will only work with poppler-utils 0.12 http://packages.debian.org/sv/squeeze/poppler-utils. Hm strange debian sid says 0.16.7 http://packages.debian.org/sv/sid/poppler-utils.
Anyway the newest is 0.18 http://poppler.freedesktop.org/ but it will not run under debian. So if someone have a another distro ubuntu for example https://launchpad.net/ubuntu/+source/poppler. It would interesting to see if it is the same html output.

What ever ...

The tool pdftohtml is generating lousy pictures . But it would be possible to extract better pictures with ghostscript http://www.perlmonks.org/?node_id=794904.

Pasting in the most imported stuff in case the thread will be removed ...

For PDF to JPG (or any other raster image format like PNG or TIFF), you could use GhostScript to do the conversion:

$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r150 -sOutputFile=i +mg%d.jpg input.pdf
[download]

This would create as many images (img1.jpg to imgN.jpg) as there are pages in the PDF file. -r is the resolution in dpi (150dpi would create an image size of 1240x1754 for A4 paper size), and -dJPEGQ is the quality factor (up to 100).

Unfortunately, this doesn't do any anti-aliasing, so the fonts typically look rather ragged... You can work around that problem by doing the anti-aliasing yourself; which means, you'd have to oversample while rendering from PDF to raster (e.g. by a factor of 4, i.e. 600dpi) and then downsample with an appropriate filter.

ImageMagick's convert can be used for the latter. The complete sequence of steps would be:

$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r600 -sOutputFile=i +mg%d.jpg input.pdf $ for img in img*.jpg ; do convert $img -filter Lanczos -resize 25% -q +uality 90 out_$img ; done
[download]

The resulting anti-aliased images out_img*.jpg would then have 150dpi resolution.

In case you have the non-/usr/bin-namespace-polluting sister GraphicsMagick installed (instead of ImageMagick), the command would be gm convert ...

(Those who hold a degree in Signal Processing - or have come in contact with filter design in some other context - might want to take a look at the list of filters to choose from — in case of doubt, stick with Lanczos or Kaiser for somewhat sharper, or Gaussian or Cubic for somewhat softer results.)

Also, there's documentation - well hidden from daylight - under /usr/share/doc/ghostscript/Devices.htm, which explains what options are available with the individual Ghostscript output devices (you usually need to have another package installed (e.g. ghostscript-doc on Debian/Ubuntu) to have that file).

I have not tested this my self.

Does anyone know what was the outcome from Andreas Gohr attempt? http://www.freelists.org/post/dokuwiki/dokuwiki2html

ach

pernils wrote If I take the example to write a document for all the different valves have been using in our products. The most flexible tool is to do it right in the Solid Edge 3D software. If some one make adjustment to a part (valve) make it look better I just have to open the draft and press update and the illustration will update it self.

If I use a "static" picture like working in word .. I have to:
fire up the 3D software
make a new draft
insert the part (valve/ engine or whatever) with the right postition
save as jpg (gives better resolution) or mark -> copy
paste it in word and sometimes use the crop function inside writer/word.

Hmm, I definitely see your point.
But there might be the better way to solve your problem ...

My idea: Every time you edit a 3D image in Solid Edge, that could also trigger some export mechanism which automatically updates the image in the wiki!? Not sure how complicated that would be. But at least there are a few "CAD to PNG" or similar tools. There could be a script using either one of the CLI commands or the XML-RPC interface to update the images in the wiki whenever you update your document. That could be combined with a special syntax to state the connection between the wiki image and the 3D image.

Not sure if this would fit into your workflow. I actually don't know anything about 3D CAD software, so please excuse if my assumptions are faulty.

ach

pernils wrote Just to give some example of pdf file.

http://www.eaton.com/ecm/groups/public/@pub/@eaton/@hyd/documents/content/pct_274365.pdf

This is much more complex than I will produce but to manually export this into DW will take a lot of effort.

Thanks for this, as seeing an example makes things clearer.

I don't think it will take more effort than for other documents to manually import it. Aside from your difficulties with the images, the only bit in there which will be a bit more complicated are the tables (basically, because tables are always a bit more complicated, no matter in which context). But there are plugins which can make that easier as well, e.g. the edittable plugin. The rest is just normal text, a few headings and the images and that's really it. Putting such a document into the wiki shouldn't take longer than 10 minutes (provided that the image problem was solved before).

pernils

ach wrote My idea: Every time you edit a 3D image in Solid Edge, that could also trigger some export mechanism which automatically updates the image in the wiki!? Not sure how complicated that would be. But at least there are a few "CAD to PNG" or similar tools. There could be a script using either one of the CLI commands or the XML-RPC interface to update the images in the wiki whenever you update your document. That could be combined with a special syntax to state the connection between the wiki image and the 3D image.

Not sure if this would fit into your workflow. I actually don't know anything about 3D CAD software, so please excuse if my assumptions are faulty.

Many 3D software could be runed from "external source" throu COM or what they called it. For example ruledesigner uses this. http://www.ruledesigner.com/RDC/servlet/ba_bset?m_cParameterCache=prymtlyjwe
With that application you set up rules that depends on user input. The app will then modify parts and assmeblies and print out part lista, sales material etc.

But to get this I must learn it's API and probably use delphi (have written som app in the past). This task will be much more over my head instead of dig in to php and string manipulation.

na.. I think the right way to go is the pdf-html-wiki approach. Due to it will be "writing tools" independent.

I could try to mail you the url to my working environment so you could see the result how it looks today. .. With HansBKK it was access denied.

the same ...access denied ...

I think if we could get this work flow fixed it would be used by fare more people than just me. Sorry that I can't contribute with the php parser.
But now when we are on post nr 19 and with my huge written text more like a blog how to do it http://www.dokuwiki.org/sv:samba_pdf_to_wiki?s[]=samba which generates highest hits on keywords like.
samba
pdftohtml
html2wiki
that must draw attention and if I (we or all DW users) are lucky maybe someone will take over from where i stopped.

What I could contribute with is to fire upp some vmware ubuntu as see if poppler-utils 0.18 will do better html jobb. That will be at work so maybe in the end of next week.

hansbkk

What I said was if at all possible, separate the creation and maintenance of your text from that of your graphics, incorporating them together only at the final stage when you need to "pretty print" it to PDF.

This will allow you much greater flexibility in incorporating FOSS software into the earlier stages of your toolchain.

If you can't do that, then

> look for software that's intended to be a DMS and indexes the text within the PDFs for you.

There may even be a FOSS project that fits, google "open-source DMS pdf lucene"

pernils

Thanks..

I will investigate the lucene project.

Just one more thing about why the possibility to handle pdf is so high on my which list.
When I make a new design of manifold for example. I often download data sheets of cartridge, valves etc. They are near 100% in pdf. The big distributors of hydraulic components have so complex web pages that it is near impossible to find what you are looking for. Even the sales men don't use them. They search instead in their catalogs.

Take for example parker hydraulics http://www.parker.com/portal/site/PARKER/menuitem.b90576e27a4d71ae1bfcc510237ad1ca/?vgnextoid=c38888b5bd16e010VgnVCM1000000308a8c0RCRD&vgnextfmt=default.
Try to find some data sheet for solenoid for direct acting on/off valve size NG6.

So when you found it you don't want to loose it. It always come in some cryptic file name based on something I don't understand. To find it again you must always rename it with some description of it content.

Anyway some tips/suggestion to our other readers (if we have some)...

If you are only interested to get a method to import text from your word processor word/writer then you could get some inspiration from this link.
http://www.dokuwiki.org/tips:doc_to_wiki_syntax?&#dokuwiki__top
It's about openoffice in headless mode.

But if you are newbie like me and don't want to spend several hours on ...how to install/ why it won't work /and how to fix it. continue to read.

If you only have text and only a few docs the easiest way is with a macro. I would suggest you look at writertools for Open/Libre office.
http://code.google.com/p/writertools/

But if you have included pictures....

In this link http://www.dokuwiki.org/tips:openofficemacro you get a feeling that you could edit this http://www.ooowiki.de/Writer2DokuWiki and get the picture stuff working. That isn't the case. The script will not export the picture from your document you will only get a link to a picture starting with grafik1.png. And you won't find that file if you look inside your .odt file with you archive manager (windows users use winrar).

So instead, maybe following approach could work.

* Drop your document into a folder.
* Run a script that will fix the filename (lowercase and remove spaces). The script will then make a new folder based on the new filename and copy the file to that location.
* With this batch converter macro http://oooconv.free.fr/batchconv/batchconv_en.html you could traverse the folder tree and ave as html.
* Copy the folder tree into your folder for import to DW. Best location would be under data/media/your import folder.
* Run another script that will traverse that folder and run each html file with the tool html2wiki and save the result txt file under data/pages/docs..

OpenOffice will not make strange html like pdftohtml does. So I think it will be okey, haven't tested it yet by my self.

To write those 2 script you could with little effort tweak my script bellow.
It's useful to use the integrated bash debugger that is activated with set -x.
Or you could use http://bashdb.sourceforge.net/

    #!/bin/bash
    # file pdfimport.sh 
    # Convert pdf to wiki syntax
    # Take pdf file and rominize it and make a dir based on pdf file name.
    # Execute pdftohtml and html2wiki on the pdf file
    # Fix image url to dokuwiki syntax
    # Replace the \\ with linefeed
    #####################################
     
    # LOGGFILE full path and file name ex. /var/www/dokuwiki/data/pages/config/logg.txt 
    # Remember dir must end with /
     
    LOGGFILE=/var/www/dokuwiki/data/pages/logg.txt
     
     
    IMPORT_ROOT=/var/www/dokuwiki/data/media/pdf_import/
    PAGES_DIR=/var/www/dokuwiki/data/pages/pdf_imported/
    DOKUWIKI_MEDIA_ROOT=/var/www/dokuwiki/data/media/
     
    #                           pdftohtml -c  "for complex"
    PDFTOHTML="/var/www/pdftohtml -c -noframes -enc UTF-8"
    HTML2WIKI="html2wiki --dialect DokuWiki"
     
     
    # Set to true for configcheck (dir exists and script pdftohtml and html2wiki could be executed
    RUN_CFG_CHK=true
     
    clear
     
    #set -x
     
    cfgchk () {
    # checking if I can write to $LOGGFILE, $IMPORT_DIR and $PAGES_DIR ..
    if [ ! -w $LOGGFILE ]; then echo "<H1>Can´t write to loggfile"; exit 1; fi
    if [ ! -w $IMPORT_ROOT ]; then echo "<H1>Can´t write to $IMPORT_ROOT  EXIT.."; exit 1; fi
    if [ ! -w $PAGES_DIR ]; then echo "<H1>Can´t write to $PAGES_DIR  EXIT... "; exit 1; fi
     
    #if [ $IMPORT_ROOT != $DOKUWIKI_MEDIA_ROOT ]; then  echo "<H1> I can´t find $DOKUWIKI_MEDIA_ROOT inside $IMPORT_ROOT  EXIT.. "; exit 1; fi
     
    # cheking if pdftohtml and html2wiki could be executed ...
    command -v $PDFTOHTML >/dev/null || { echo "<H1>Can´t execute $PDFTOHTML  EXIT.."; exit 1;}
    command -v $HTML2WIKI >/dev/null || { echo "<H1>Can´t execute $HTML2WIKI  EXIT.."; exit 1;} 
    echo 'You could turn the cfgchk off now e.g RUN_CFG_CHK=false' >> "$LOGGFILE"
    }
     
    # Mark the row bellow or set $RUN_CFG_CHK=false after you have check it is okey
     
    if $RUN_CFG_CHK ; then cfgchk; fi
     
        # loggfile  'basename $0 will give the running scripts name
        # writing to logg the script name and date and time
    echo "==== Script \"`basename $0`\" is started ... $(date -u) ====" >> "$LOGGFILE"
     
        # Main loop
    find $IMPORT_ROOT -maxdepth 1 -iname '*.pdf' | while read file;
      do
        # instead of string chopping you could use basename and filename
        oldfilename=$(basename "$file") 
        # write to logg what we have found (only .pdf files )
        echo "== Found file ...     \"${file##*/}\" ... ==" >> "$LOGGFILE"
        # Rominazie the file
        newfilename=$(echo "$oldfilename" | tr 'A-Z' 'a-z' | tr ' ' '_' |
        		    tr 'Å' 'å' | tr 'Ä' 'ä' | tr 'Ö' 'ö' | 
       				    sed 's/_-_/-/g');
     
       if [ "$newfilename" != "$oldfilename" ]; then
    	    # Write to logg the new name
                # echo -n "text" >> file.txt  (will write "text" in the beginning of file.txt with no line feed)
                # the code bellow  will write to the end of the file with no line feed your sed implementation must support -i option
                # sed -i.bck '$s/$/After rominize......   '"$newfilename"'/ ' "$LOGGFILE"
            echo "== After rominize  .... $newfilename" >> "$LOGGFILE"
        fi
        	    # make dir based on newfilename 
    	newdir="$IMPORT_ROOT""${newfilename%%.*}"   
                # create new dir
            mkdir -p "$newdir"
            echo "== Have created a new dir .. $newdir ==" >> "$LOGGFILE"
    	    # copy the file no new dir with new name (new if rominized)
    	cp "$IMPORT_ROOT$oldfilename" "$newdir/$newfilename"
    	echo "== Copy $IMPORT_ROOT$newfilename" to "$newdir/$newfilename ==" >> "$LOGGFILE"
    	    # remove old file
            rm "$IMPORT_ROOT/$oldfilename"
    	echo "== Removed $IMPORT_ROOT$oldfilename ==" >> "$LOGGFILE"
     
        # convert from pdf to html  using the -c option (complex)
        # mark that -enc is not shown in the help for pdftohtml it is in xftp
        # Exec pdftohtml and write it to $LOGGFILE
        $PDFTOHTML $newdir/$newfilename $newdir/${newfilename%%.*}.html >> "$LOGGFILE"
     
     
     
        # Convert html to wiki syntax and fix the right path for pictures
        # first create a dummy url so we can find and replace that with correct path to pictures in wiki syntax 
        dummy=/dummy/
        # Convert to wiki syntax in the same dir as rominized pdf file
        echo "$HTML2WIKI --base-uri /dummy/ $newdir/${newfilename%%.*}.html $newdir/${newfilename%%.*}.txt" >> "$LOGGFILE" 
        $HTML2WIKI --base-uri /dummy/ $newdir/${newfilename%%.*}.html > $newdir/${newfilename%%.*}.txt
     
        # We take $IMPORT_ROOT and chop off $DOKUWIKI_MEDIA_ROOT and add $newdir with the $IMPORT_ROOT chopped off
        mediapath="${IMPORT_ROOT##"$DOKUWIKI_MEDIA_ROOT"}${newdir##"$IMPORT_ROOT"}"/
        # Repacing all the / with wiki's :
        mediapath=":$(echo "$mediapath" | tr '/' ':')"
    echo "$mediapath"
        # Replacing the dummy string "/dummy/" with the correct wiki syntax path -> $mediapath in the .txt file
        # sed -i (-i in place= "on the spot")
     
        sed -i "s/\/dummy\//$mediapath/g" "$newdir/${newfilename%%.*}".txt
        # Replace \\ with line feed
        sed -i "s/\\\/\n/g" "$newdir/${newfilename%%.*}".txt
        cp "$newdir/${newfilename%%.*}".txt "$PAGES_DIR/${newfilename%%.*}".txt
    done

Global DokuWiki Links