Google Answers Whether It's Better to OCR Text in PDFs or Not

An SEO professional asked John Mueller during a hangout about embedded PDF files.

Their question was: their site uses iframes and a script to embed PDF files onto their pages. Is there any advantage to taking the OCR text of the PDF and pasting it into the HTML documents for SEO?

Or will Google simply parse the PDF contents with the same weight and relevance to index the content?

John explained that he was momentarily thrown off by this question, because it sounded like they wanted to take the text of the PDF and just hide it in the HTML for SEO reasons. And this is something that John would not recommend doing.

If you want to have the content indexable, he said, then you should make it visible on the page.

So that’s the first thing that he would say, regarding PDFs.

He did confirm that Google does try to take the text out of the PDFs and index that for the PDFs themselves.

From a practical perspective, what happens with a PDF is, as one of the first steps, Google will convert it into an HTML page, and they try to index that like an HTML page.

So, essentially, what you’re doing is you’re iframing it, and it would be an indirect HTML page.

And, when it comes to iframes, Google does take this content into account for indexing within that primary page.

But, it can also happen that they index the PDF separately anyway. From that point of view, it’s really difficult to say exactly what will happen.

John would turn the question around and frame it as “what do you want to have happen?” And if you want your normal web pages to be indexed with the content of the PDF file, then make it so that the content is immediately visible on the HTML page.

Instead of embedding the PDF as the primary piece of content, make the HTML content the primary piece and then link to the PDF file.

Then there is the question of whether you want those PDFs indexed separately or not?

Sometimes you do want to have PDFs indexed separately. And if you want to have them indexed separately, then linking to them is great.

If you don’t want to have them indexed separately, then you can use robots.txt to block their indexing.

You can also use the noindex, X-robots HTTP header. This is a bit more complicated because you have to serve that as a header for the PDF files.

This happens at approximately the 17:30 mark in the video.

John Mueller Hangout Transcript

John (Question)
Let’s see a more technical one here. Our website uses iframes and a script to embed PDF files onto our pages in our website. Is there any advantage to taking the OCR text of the PDF and pasting it somewhere into the document’s HTML for SEO purposes? Or will Google simply parse the PDF contents with the same weight and relevance to index the content?

John (Answer)
Yeah, so I’m just like momentarily, kind of thrown off, because it sounds like you want to take the text of the PDF and just kind of like, hide it in the HTML for SEO purposes. And that’s something I would definitely not recommend doing. If you want to have the content indexable, then make it visible on the page. So that’s kind of like the first thing there that I would say. With regards to PDFs, we do try to take the text out of the PDFs and index that for the PDFs themselves. From a practical point of view, what happens with a PDF is, as one of the first steps, we convert it into an HTML page, and we try to index that like an HTML page.

So essentially, what you’re doing is you’re kind of like…iframing an indirect HTML page. And when it comes to iframes, we can take that content into account for indexing within the primary page. But it can also happen that we index the PDF separately anyway. So from that point of view, it’s it’s really hard to say exactly, kind of like what will happen. I would turn the question around and kind of frame it as “what do you want to have happened?”

And if you want your normal web pages to be indexed with the content of the PDF file, then make it so that that content is immediately visible on the HTML page. So instead of embedding the PDF as a primary piece of content, make the HTML content the primary piece and link to the PDF file. And then there is a question of do you want those PDFs indexed separately or not? Sometimes you do want to have PDFs indexed separately. And if you do want to have them indexed separately, then linking to them is great.

If you don’t want to have them indexed separately than using robots.txt to block their indexing is also fine. You can also use the noindex X-robots HTTP header. It’s a little bit more complicated because you have to serve that as a header for the PDF files if you want to have those PDF files, kind of like available in the iframe, but not actually indexed.

John Mueller Hangout Transcript

Brian Harnish

Recent Articles

Google Answers Whether It’s Better to OCR Text in PDFs or Not

John Mueller Hangout Transcript

Brian Harnish

Recent Articles