An SEO professional asked John Mueller during a hangout about embedded PDF files.
Their question was: their site uses iframes and a script to embed PDF files onto their pages. Is there any advantage to taking the OCR text of the PDF and pasting it into the HTML documents for SEO?
Or will Google simply parse the PDF contents with the same weight and relevance to index the content?
John explained that he was momentarily thrown off by this question, because it sounded like they wanted to take the text of the PDF and just hide it in the HTML for SEO reasons. And this is something that John would not recommend doing.
If you want to have the content indexable, he said, then you should make it visible on the page.
So that’s the first thing that he would say, regarding PDFs.
He did confirm that Google does try to take the text out of the PDFs and index that for the PDFs themselves.
From a practical perspective, what happens with a PDF is, as one of the first steps, Google will convert it into an HTML page, and they try to index that like an HTML page.
So, essentially, what you’re doing is you’re iframing it, and it would be an indirect HTML page.
And, when it comes to iframes, Google does take this content into account for indexing within that primary page.
But, it can also happen that they index the PDF separately anyway. From that point of view, it’s really difficult to say exactly what will happen.
John would turn the question around and frame it as “what do you want to have happen?” And if you want your normal web pages to be indexed with the content of the PDF file, then make it so that the content is immediately visible on the HTML page.
Instead of embedding the PDF as the primary piece of content, make the HTML content the primary piece and then link to the PDF file.
Then there is the question of whether you want those PDFs indexed separately or not?
Sometimes you do want to have PDFs indexed separately. And if you want to have them indexed separately, then linking to them is great.
If you don’t want to have them indexed separately, then you can use robots.txt to block their indexing.
You can also use the noindex, X-robots HTTP header. This is a bit more complicated because you have to serve that as a header for the PDF files.
This happens at approximately the 17:30 mark in the video.