PDF extract text [message #185508] |
Mon, 07 April 2014 04:53 |
Philipp Kraus
Messages: 14 Registered: December 2010
Karma: 0
|
Junior Member |
|
|
Hello,
how can I extract text, images and other structures can be ignored,
with PHP from a PDF file?
We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
extract only the text content
to create a text analysis of the content eg for LaTeX scripts we would
like the chapter structure as well.
Is there any solution to do this with build-in PHP functions?
Thanks
Phil
|
|
|
Re: PDF extract text [message #185511 is a reply to message #185508] |
Mon, 07 April 2014 11:44 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Philipp Kraus wrote:
> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?
For example with “PDF Parser”. You cannot have searched before posting; it
took me less than a minute to find that out with the Google keywords “pdf
php read”.
<http://www.catb.org/~esr/faqs/smart-questions.html>
> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.
PDF files generated with pdflatex usually contain that as TOC metadata.
> Is there any solution to do this with build-in PHP functions?
^t
No.
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
|
|
|
Re: PDF extract text [message #185515 is a reply to message #185508] |
Mon, 07 April 2014 15:53 |
Michael Vilain
Messages: 88 Registered: September 2010
Karma: 0
|
Member |
|
|
In article <lhtavi$dh$1(at)online(dot)de>,
Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:
> Hello,
>
> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?
> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.
>
> Is there any solution to do this with build-in PHP functions?
>
> Thanks
>
> Phil
I tried a bunch of stuff to read some bank statements that were in PDF
format so I could import them via CSV. Didn't work out so well. Adobe's
OCR feature only works if the PDFs are unlocked to allow it. I found an
application that would do that but the OCRed text was unusable.
So, my question is "what's generating the PDF files?" Can you get
whomever to do it in text or some other format? If they're encrypted
images, then you've got a lot of work to do in order to get some output.
Maybe.
Good luck with this...
--
DeeDee, don't press that button! DeeDee! NO! Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]
|
|
|
Re: PDF extract text [message #185516 is a reply to message #185515] |
Mon, 07 April 2014 17:38 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Michael Vilain wrote:
> Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:
>> how can I extract text, images and other structures can be ignored,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> with PHP from a PDF file?
>> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
^^^^^ ^^^^^^^^^^
>> extract only the text content
>> to create a text analysis of the content eg for LaTeX scripts we would
>> like the chapter structure as well.
>>
>> Is there any solution to do this with build-in PHP functions?
>
> I tried a bunch of stuff to read some bank statements that were in PDF
> format so I could import them via CSV. Didn't work out so well. Adobe's
> OCR feature only works if the PDFs are unlocked to allow it. I found an
> application that would do that but the OCRed text was unusable.
>
> So, my question is "what's generating the PDF files?"
The ability to read can be of advantage sometimes …
> Can you get whomever to do it in text or some other format?
OMG. One can leave it to you to give the worst possible technical advice.
> If they're encrypted images, then you've got a lot of work to do in order
> to get some output. Maybe.
Nobody but you is talking about images and OCR. You really don't have a
clue what PDF is, do you?
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.t r
|
|
|
Re: PDF extract text [message #185517 is a reply to message #185511] |
Mon, 07 April 2014 19:52 |
Christoph Michael Bec
Messages: 207 Registered: June 2013
Karma: 0
|
Senior Member |
|
|
Thomas 'PointedEars' Lahn wrote:
> Philipp Kraus wrote:
>
>> Is there any solution to do this with build-in PHP functions?
> ^t
> No.
Well, there may not be a solution to do this with built-in PHP functions
(whatever a built-in PHP function might be; actually (almost) all PHP
functions are part of an extension), but at least *theoretically* it
would be possible by processing the PDF file "bytewise". (The PDF
specification is available online for free.)
--
Christoph M. Becker
|
|
|
Re: PDF extract text [message #185518 is a reply to message #185517] |
Mon, 07 April 2014 20:12 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Christoph Michael Becker wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Philipp Kraus wrote:
>>> Is there any solution to do this with build-in PHP functions?
>> ^t
>> No.
>
> Well, there may not be a solution to do this with built-in PHP functions
> (whatever a built-in PHP function might be; actually (almost) all PHP
> functions are part of an extension), but at least *theoretically* it
> would be possible by processing the PDF file "bytewise". (The PDF
> specification is available online for free.)
*rolls eyes*
*bags collected trolls’ eyes*
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
|
|
|