PHP Function to extract anything between two tags

Here’s a small snippet of code that I use very frequently when parsing out webpages or content for specific items. For example on any webpage you need to extract data which is present like this:

<html><body><h1>ABC</h1>.... <!-- A lot list of code --><div id="myNewsItem">This is my news, and I am interested in extracting this out</div>.... <!-- and the HTML code continues on --></body></html>

I would like to extract out the data between the DIV tag “myNewsItem”.

Here’s the PHP function to do the extraction:

function SimMyExtract($string, $openingTag, $closingTag){    $string = trim($string);    $start  = intval(strpos($string,$openingTag)                       + strlen($openingTag));    $end    = intval(strpos($string,$closingTag));

    if($start == 0 || $end ==0)    return false; // not found

    $mytext = substr($string,$start, $end - $start);    return $mytext;}

Usage for above example:

SimMyExtract( $content, '<div id="myNewsItem">', '</div>' );

You can use it recursively to extract items in a list of similar tags  (i.e. when the same tag is used a number of times on the same page). To offer more power I use it in conjunction with regular expressions. I would rid you from going into any further details for RegEx but it is absolutely powerful, and I love the way RegEx is implemented in PHP (both Perl’s PREG and EREG)…

For instance the same function could be reduced to:

ereg( $openingTag."[a-zA-Z0-9<>/]+".$closingTag,       $content, $result);return implode($result,'');

The point is RegEx is able to capture a lot of occurrences and extract out, you need to master regex. Without that an interesting exercise could be to extract all URLs (content of HREF) from a webpage.

Be Sociable, Share!
    This entry was posted in Experiments, Old Ramblings and tagged , , . Bookmark the permalink.

    6 Responses to PHP Function to extract anything between two tags

    1. greyfade says:

      You could simplify this even more by not passing the closing tag at all:

      $closingTag = preg_replace("/\\<\\s*(\\S*)[^>]*\\>/", "</\\1>", $openingTag);

    2. Dave says:

      It would seem that this is not going to work with nested tags … as in the case of
      <html>
      <body>
      <h1>ABC</h1>
      …. <!– A lot list of code –>
      <div id="myNewsItem">This is my news,
      <div class="pullquote">the rain in spain</div>
      and I am
      interested in extracting this out</div>
      …. <!– and the HTML code continues on –>
      </body>
      </html>

      It would appear that the regex will extract
      This is my news,
      <div class="pullquote">the rain in spain</div>
      and miss
      and I am
      interested in extracting this out

      Am I mistaken? if not, is there a way to make this work as intended?

    3. Asim says:

      Perhaps. But when you have a list of nodes that you would want to traverse and extract, its better to use domXML with xquery.

    4. ishika says:

      Please tell me how you recursively used it i am using it in that manner but its returning only one result i need all the data which comes in between the tags again and again

      Please help

      below is my code :

      <?php

      function SimMyExtract($string, $openingTag, $closingTag)
      {
      $string = trim($string);
      $start = intval(strpos($string,$openingTag)
      + strlen($openingTag));
      $end = intval(strpos($string,$closingTag));

      if($start == 0 || $end ==0)
      return false; // not found

      $mytext = substr($string,$start, $end – $start);
      return $mytext;
      }

      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL,"http://www.lonare.com&quot;);
      curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
      curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
      $result = curl_exec ($ch);
      curl_close ($ch);

      $text = strip_tags($result);

      $text = str_replace("Today,", "<div class=\"mydata\">", $text);

      $text = str_replace("FML#", " </div> ", $text);

      $text1 = SimMyExtract($text, ‘<div class="mydata">’, ‘</div>’);

      echo $text1."<br>";

      ?>

    5. Dan says:

      I cant seem to make this piece of code work…
      this is my PHP file

      <?PHP
      function SimMyExtract($string, $openingTag, $closingTag)
      {
      $string = trim($string);
      $start = intval(strpos($string,$openingTag) + strlen($openingTag));
      $end = intval(strpos($string,$closingTag));

      if($start == 0 || $end ==0)
      return false; // not found

      $mytext = substr($string,$start, $end – $start);
      return $mytext;
      }
      ?>

      <html>
      <body>
      <h1>ABC</h1>
      …. <!– A lot list of code –>
      <div id="myNewsItem">This is my news, and I am
      interested in extracting this out</div>
      …. <!– and the HTML code continues on –>

      <?PHP
      $exdata = SimMyExtract($text, ‘<div id="myNewsItem">’, ‘</div>’);
      echo $exdata;
      ?>

      </body>
      </html>

      Any ideas?

    6. I Think of your talents as the things you’re really good at. They’re like personality traits. For instance, you may be a very creative person, or a person who’s really good at attending to details or a person with a gift for communicating. Your talents are the base for any successful business venture, including a home-based business.

    Leave a Reply

    Your email address will not be published. Required fields are marked *