A Flexible Text Parser Without Regular Expressions

Regular expressions are great but sometimes you want to do it faster (meaning easier), depending on what you’re trying to do.  When you quickly want to parse repetitive text and return an array this function can come in handy.

Example 1) Grab text from html table into an array

$html = '
<table>
	<tr>
		<td>1 a</td>
		<td class="middle">1 b</td>
		<td>1 c</td>
	</tr>
	<tr>
		<td>2 a</td>
		<td class="middle">2 b</td>
		<td>3 c</td>
	</tr>
	<tr>
		<td>3 a</td>
		<td class="middle">3 b</td>
		<td>3 c</td>
	</tr>
</table>';

print_r(parseBetweenText($html, 'class="middle">','<'));

output:

Array
(
	[0] => 1 b
	[1] => 2 b
	[2] => 3 b
)

Example 2) Parse subject from email headers

In this example were not expecting or even wanting an array, just the 1st result match.

$email = '
Delivered-To: ****@bitbook.io
Received: by 10.37.95.130 with SMTP id t124csp1610937ybb;
Mon, 28 Mar 2016 13:01:55 -0700 (PDT)
X-Received: by 10.140.90.106 with SMTP id w97mr9006787qgd.14.1459195315410;
Mon, 28 Mar 2016 13:01:55 -0700 (PDT)
Return-Path: <****@bubbleup.net>
Received: from relay4.bubbleup.net (relay4.bubbleup.net. [54.152.96.196])
by mx.google.com with ESMTP id q205si22435938qhq.67.2016.03.28.13.01.55
for <****@bitbook.io>;
Mon, 28 Mar 2016 13:01:55 -0700 (PDT)
Received-SPF: pass (google.com: domain of [email protected] designates 54.152.96.196 as permitted sender) client-ip=54.152.96.196;
Authentication-Results: mx.google.com;
spf=pass (google.com: domain of *****@bubbleup.net designates 54.152.96.196 as permitted sender) smtp.mailfrom=****@bubbleup.net
Message-Id: <56f98db3.d6aa8c0a.8339f.ffffb8cfSMTPIN_ADDED_MISSING@mx.google.com>
To: ****@bitbook.io
Subject: Welcome to the Online Congress Community.
X-PHP-Originating-Script: 12386:functions.php
FROM: ****@bubbleup.net
Date: Mon, 28 Mar 2016 15:02:02 -0500 (CDT)
';
 
print_r(parseBetweenText($email, "Subject:", "\n", true, false, true));

output:

Welcome to the Online Congress Community.

Parse Function

function parseBetweenText(
	$text,
	$beginText,
	$endText,
	$removeSpace=true,
	$removeHtmlTags=true,
	$firstResultOnlyNoArray=false,
	$includeBeginEndText=false) {
		$results = array();
		$endPos = 0;
		while(true) {
			$beginPos = stripos($text, $beginText, $endPos);
			if($beginPos===false) break;
			if(!$includeBeginEndText) $beginPos = $beginPos+strlen($beginText);
			$endPos = stripos($text, $endText, $beginPos);
			if($endPos===false) break;
			if($includeBeginEndText) $endPos = $endPos+strlen($endText);
			$result = substr($text, $beginPos, $endPos-$beginPos);
			if($removeSpace){
					$result = str_replace("\t","",$result);
					$result = str_replace("\n","",$result);
					$result = preg_replace("/  /"," ",$result);
					$result = preg_replace("~[\s]{2}?[\t]?~i"," ",$result);
					$result = str_replace("  "," ",$result);
					$result = trim($result);
			}
			if($removeHtmlTags){
					$result = strip_tags($result);
			}
			if($firstResultOnlyNoArray) return $result;
			if($result != '') $results[] = $result;
		}
	return ($firstResultOnlyNoArray && empty($results) ? '' : $results) ;
}

Leave a Reply

Your email address will not be published. Required fields are marked *