Function to Parse URLs From a PDF Document

You don’t have to actually programmatically read a PDF to parse out the links in them. This simple routine below will parse out all the links embedded from a PDF and return an array of URL/URIs.

function getLinksFromPDF($pdfFileLocation){
	$pdfContent = file_get_contents($pdfFileLocation, true);
	preg_match_all('/(URI|URI )\((.*?)\)/', $pdfContent, $matches);
	$links = array();
	foreach($matches[2] as $link){
		$links[] = $link;
	}
	return array_unique($links);
}

Leave a Reply

Your email address will not be published. Required fields are marked *