FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » PDF extract text
Show: Today's Messages :: Unread Messages :: Show Polls :: Message Navigator
| Subscribe to topic | Bookmark topic 
Switch to threaded view of this topic Create a new topic Submit Reply
PDF extract text [message #185508] Mon, 07 April 2014 00:53 Go to next message
Philipp Kraus is currently offline  Philipp Kraus
Messages: 14
Registered: December 2010
Karma: 0
Junior Member
add to buddy list
ignore all messages by this user
Hello,

how can I extract text, images and other structures can be ignored,
with PHP from a PDF file?
We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
extract only the text content
to create a text analysis of the content eg for LaTeX scripts we would
like the chapter structure as well.

Is there any solution to do this with build-in PHP functions?

Thanks

Phil
Re: PDF extract text [message #185511 is a reply to message #185508] Mon, 07 April 2014 07:44 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
add to buddy list
ignore all messages by this user
Philipp Kraus wrote:

> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?

For example with “PDF Parser”. You cannot have searched before posting; it
took me less than a minute to find that out with the Google keywords “pdf
php read”.

<http://www.catb.org/~esr/faqs/smart-questions.html>

> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.

PDF files generated with pdflatex usually contain that as TOC metadata.

> Is there any solution to do this with build-in PHP functions?
^t
No.

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
Re: PDF extract text [message #185515 is a reply to message #185508] Mon, 07 April 2014 11:53 Go to previous messageGo to next message
Michael Vilain is currently offline  Michael Vilain
Messages: 88
Registered: September 2010
Karma: 0
Member
add to buddy list
ignore all messages by this user
In article <lhtavi$dh$1(at)online(dot)de>,
Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:

> Hello,
>
> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?
> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.
>
> Is there any solution to do this with build-in PHP functions?
>
> Thanks
>
> Phil

I tried a bunch of stuff to read some bank statements that were in PDF
format so I could import them via CSV. Didn't work out so well. Adobe's
OCR feature only works if the PDFs are unlocked to allow it. I found an
application that would do that but the OCRed text was unusable.

So, my question is "what's generating the PDF files?" Can you get
whomever to do it in text or some other format? If they're encrypted
images, then you've got a lot of work to do in order to get some output.
Maybe.

Good luck with this...

--
DeeDee, don't press that button! DeeDee! NO! Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]
Re: PDF extract text [message #185516 is a reply to message #185515] Mon, 07 April 2014 13:38 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
add to buddy list
ignore all messages by this user
Michael Vilain wrote:

> Philipp Kraus <philipp(dot)kraus(at)flashpixx(dot)de> wrote:
>> how can I extract text, images and other structures can be ignored,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> with PHP from a PDF file?
>> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
^^^^^ ^^^^^^^^^^
>> extract only the text content
>> to create a text analysis of the content eg for LaTeX scripts we would
>> like the chapter structure as well.
>>
>> Is there any solution to do this with build-in PHP functions?
>
> I tried a bunch of stuff to read some bank statements that were in PDF
> format so I could import them via CSV. Didn't work out so well. Adobe's
> OCR feature only works if the PDFs are unlocked to allow it. I found an
> application that would do that but the OCRed text was unusable.
>
> So, my question is "what's generating the PDF files?"

The ability to read can be of advantage sometimes …

> Can you get whomever to do it in text or some other format?

OMG. One can leave it to you to give the worst possible technical advice.

> If they're encrypted images, then you've got a lot of work to do in order
> to get some output. Maybe.

Nobody but you is talking about images and OCR. You really don't have a
clue what PDF is, do you?

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.t r
Re: PDF extract text [message #185517 is a reply to message #185511] Mon, 07 April 2014 15:52 Go to previous messageGo to next message
Christoph Michael Bec is currently offline  Christoph Michael Bec
Messages: 207
Registered: June 2013
Karma: 0
Senior Member
add to buddy list
ignore all messages by this user
Thomas 'PointedEars' Lahn wrote:

> Philipp Kraus wrote:
>
>> Is there any solution to do this with build-in PHP functions?
> ^t
> No.

Well, there may not be a solution to do this with built-in PHP functions
(whatever a built-in PHP function might be; actually (almost) all PHP
functions are part of an extension), but at least *theoretically* it
would be possible by processing the PDF file "bytewise". (The PDF
specification is available online for free.)

--
Christoph M. Becker
Re: PDF extract text [message #185518 is a reply to message #185517] Mon, 07 April 2014 16:12 Go to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
add to buddy list
ignore all messages by this user
Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Philipp Kraus wrote:
>>> Is there any solution to do this with build-in PHP functions?
>> ^t
>> No.
>
> Well, there may not be a solution to do this with built-in PHP functions
> (whatever a built-in PHP function might be; actually (almost) all PHP
> functions are part of an extension), but at least *theoretically* it
> would be possible by processing the PDF file "bytewise". (The PDF
> specification is available online for free.)

*rolls eyes*

*bags collected trolls’ eyes*

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
Quick Reply
Formatting Tools:   
  Switch to threaded view of this topic Create a new topic
Previous Topic: MYSQL PHP Query Not Working
Next Topic: Install Apache Php Windows 64 bit
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Wed Oct 18 01:46:03 EDT 2017

Total time taken to generate the page: 0.00672 seconds