FUDforum: comp.lang.php » PDF extract text

Home » Imported messages » comp.lang.php » PDF extract text

Show: Today's Messages :: Polls :: Message Navigator

PDF extract text [message #185508]

Mon, 07 April 2014 04:53

Philipp Kraus
Messages: 14
Registered: December 2010

Karma: 0

Junior Member

Hello,

how can I extract text, images and other structures can be ignored,
with PHP from a PDF file?
We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
extract only the text content
to create a text analysis of the content eg for LaTeX scripts we would
like the chapter structure as well.

Is there any solution to do this with build-in PHP functions?

Thanks

Phil

Report message to a moderator

Re: PDF extract text [message #185511 is a reply to message #185508]

Mon, 07 April 2014 11:44

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Philipp Kraus wrote:

> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?

For example with “PDF Parser”. You cannot have searched before posting; it
took me less than a minute to find that out with the Google keywords “pdf
php read”.

<http://www.catb.org/~esr/faqs/smart-questions.html>

> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.

PDF files generated with pdflatex usually contain that as TOC metadata.

> Is there any solution to do this with build-in PHP functions?
^t
No.

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Report message to a moderator

Re: PDF extract text [message #185515 is a reply to message #185508]

Mon, 07 April 2014 15:53

Michael Vilain
Messages: 88
Registered: September 2010

Karma: 0

Member

In article <lhtavi$dh$1(at)online(dot)de>,
Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:

> Hello,
>
> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?
> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.
>
> Is there any solution to do this with build-in PHP functions?
>
> Thanks
>
> Phil

I tried a bunch of stuff to read some bank statements that were in PDF
format so I could import them via CSV. Didn't work out so well. Adobe's
OCR feature only works if the PDFs are unlocked to allow it. I found an
application that would do that but the OCRed text was unusable.

So, my question is "what's generating the PDF files?" Can you get
whomever to do it in text or some other format? If they're encrypted
images, then you've got a lot of work to do in order to get some output.
Maybe.

Good luck with this...

--
DeeDee, don't press that button! DeeDee! NO! Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]

Report message to a moderator

Re: PDF extract text [message #185516 is a reply to message #185515]

Mon, 07 April 2014 17:38

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Michael Vilain wrote:

> Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:
>> how can I extract text, images and other structures can be ignored,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> with PHP from a PDF file?
>> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
^^^^^ ^^^^^^^^^^
>> extract only the text content
>> to create a text analysis of the content eg for LaTeX scripts we would
>> like the chapter structure as well.
>>
>> Is there any solution to do this with build-in PHP functions?
>
> I tried a bunch of stuff to read some bank statements that were in PDF
> format so I could import them via CSV. Didn't work out so well. Adobe's
> OCR feature only works if the PDFs are unlocked to allow it. I found an
> application that would do that but the OCRed text was unusable.
>
> So, my question is "what's generating the PDF files?"

The ability to read can be of advantage sometimes …

> Can you get whomever to do it in text or some other format?

OMG. One can leave it to you to give the worst possible technical advice.

> If they're encrypted images, then you've got a lot of work to do in order
> to get some output. Maybe.

Nobody but you is talking about images and OCR. You really don't have a
clue what PDF is, do you?

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.t r

Report message to a moderator

Re: PDF extract text [message #185517 is a reply to message #185511]

Mon, 07 April 2014 19:52

Christoph Michael Bec
Messages: 207
Registered: June 2013

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:

> Philipp Kraus wrote:
>
>> Is there any solution to do this with build-in PHP functions?
> ^t
> No.

Well, there may not be a solution to do this with built-in PHP functions
(whatever a built-in PHP function might be; actually (almost) all PHP
functions are part of an extension), but at least *theoretically* it
would be possible by processing the PDF file "bytewise". (The PDF
specification is available online for free.)

--
Christoph M. Becker

Report message to a moderator

Re: PDF extract text [message #185518 is a reply to message #185517]

Mon, 07 April 2014 20:12

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Philipp Kraus wrote:
>>> Is there any solution to do this with build-in PHP functions?
>> ^t
>> No.
>
> Well, there may not be a solution to do this with built-in PHP functions
> (whatever a built-in PHP function might be; actually (almost) all PHP
> functions are part of an extension), but at least *theoretically* it
> would be possible by processing the PDF file "bytewise". (The PDF
> specification is available online for free.)

*rolls eyes*

*bags collected trolls’ eyes*

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Report message to a moderator

Previous Topic:	MYSQL PHP Query Not Working
Next Topic:	Install Apache Php Windows 64 bit

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Mon May 12 06:39:04 GMT 2025

Total time taken to generate the page: 0.05023 seconds