HTML Excerpts Split By Lines/Words Via PHP

I recently wrote a php function to summarize long html chunks into an approximate number of lines. You pass in a parameter for the characters per line calculation to use (it varies depending on the paragraph width and font size) and the function will split your excerpt at the last convenient word to match the character limit specified. It closes any html tags, replaces images with alt tags, and counts line-breaking tags as extra space (like p and br).

It is, of course, only approximate, since font size can be easily changed in browsers (see my earlier complaint about pixel fonts), and since margins and varying character widths, etc. etc. all can change the amount of space used. But it can create nice-looking chunks of text with a general similarity in size.

function limit_lines($text, $line_limit=10, $end = '...', $chars_per_line=110)
{
  if (strlen($text)<$limit) 
    return $text;
    
  $split_at=$line_limit*$chars_per_line;//default, if no html at all
  $total_chars=$chunk_length=0; //character tallys 
  $lines=$curr_line_length=0; //line tallys
  $over_limit=false; //stop flag
  $tags = array(); //to track closing tags needed
  
  //first, remove any images (which are too variable in height) and replace with alt tags:
  preg_match_all('/]+\"|\'[^\'>]+\'|[^\s]+))?[^>]*>/i', $text, $matches,PREG_SET_ORDER);
  foreach($matches as $index=>$match){
    $image_text=preg_replace('/(^[\"\']|[\"\']$)/','',$match[1]);
    $image_text=($image_text)?" [img: $image_text] ":" ";
    $text=str_replace($match[0],$image_text,$text);
  }
  
  //locate the character limit at which we want to split, by adding up the non-html text:
  preg_match_all('/<[^>]+>[^<]*/', $text, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER); 
  
  //we have a one-off case if there is any text BEFORE first html tag.  Deal with here:
  if ($matches && $matches[0][0][1]>0) {
    //add up tallys, check to see if we've reached the limit:
    $split_at=$total_chars=$matches[0][0][1]; //update split marker and character tally
    $lines=floor($total_chars/$chars_per_line); //update line tally
    $curr_line_length=$total_chars%$chars_per_line;//hold onto modulo/leftover chars for next check
    if ($lines>$line_limit) { //over limit!
      $split_at-=($lines-$line_limit)*$chars_per_line;//move split marker back the amount we went over
      $over_limit=true;//set flag to stop counting
    }
  }
  
  //loop through each chunk of text found between html tags:
  foreach($matches as $index=>$match){
    if ($over_limit) break;

    //tag magic to make sure we add closing tags back in later without losing any:
    $tag = substr(strtok($match[0][0], " \t\n\r\0\x0B>"), 1);
    if($tag[0] != '/') {
      $tags[] = $tag;
      //some tags add to line length:
      if (in_array($tag,array("br","hr")) ||
          (!$index==0 && in_array($tag,array("p","h1","h2","h3","h4","h5","blockquote")))) {
        $lines++;
        $curr_line_length=0;
      }
    }
    elseif(end($tags) == substr($tag, 1)) {
      array_pop($tags);
      //some closing tags add to line length:
      if (in_array($tag,array("p","h1","h2","h3","h4","h5","blockquote"))) {
        $lines++;
        $curr_line_length=0;
      }
    }

    //add up text, and check to see if we've reached the limit
    $total_chars+=$chunk_length=strlen(strip_tags($match[0][0]));//update character tally
    $split_at=$match[0][1]+strlen($match[0][0]); //update split marker
    if (($curr_line_length+=$chunk_length) > $chars_per_line){
      $lines+=floor($curr_line_length/$chars_per_line); //update line tally
      $curr_line_length=$curr_line_length%$chars_per_line;//hold onto modulo/leftover chars for next check
      if ($lines>$line_limit) { //over limit!
        $split_at-=min($chunk_length,($lines-$line_limit)*$chars_per_line);//move split marker back the amount we went over
        $over_limit=true;//set flag to stop counting
      }
    }
  }

  //so now we chop the text at the character limit decided:
  $text=substr($text, 0, $split_at);

  //and if possible, split at first space found before end, not inside a tag
  // (if no spaces before the tag, then just stop with the chop above)
  if (preg_match('/\s[^\s>]*$/',$text,$match,PREG_OFFSET_CAPTURE)){
    $text=substr($text, 0, $match[0][1]);
  }
  
  //add closing tags and concluding ... :
  $text.= (($over_limit)?$end:"").(count($tags = array_reverse($tags)) ? '' : '');
  
  return $text;
}