Startups

Technology

Reviews

Apps

Learn How To Convert HTML Into RSS Feed

www.techsunk.com

Startups

Technology

Reviews

Apps

Are you someone who wants your website to be working in an RSS feed? Or want to convert HTML to RSS? But what exactly is RSS? How can we convert HTML to RSS online? 

Understanding RSS

  • RSS stands for Really Simple Syndication
  • RSS allows you to associate your site content
  • RSS defines a simple way to share and view headlines and content
  • RSS files can be automatically updated
  • RSS allows personalized views for different sites
  • RSS is written in XML

Sites without RSS feeds

Well, some wonderful websites provide RSS feeds that makes it easy for the user to find out whenever there is something interesting on their websites, without actually opening one! Isn’t it cool? But some websites aren’t able to provide an RSS feed to their websites.

It is most likely that larger and larger websites will come up with these RSS feed in future.

<link rel=”alternate” type=”application/rss+xml”

href=”http://that_rss_url” >

You can also search it on google for “sitename RSS” or “sitename RDF”.

But if none of these hacks works then it’s high time to write an RSS generator that extracts content from the site’s HTML and hence you can convert your HTML webpage to RSS webpage and hence you can convert HTML to RSS.

Scanning HTML

When we are processing HTML to find the bits of content that you want is one of the black arts of modern programming. Not only is each web page distinguished, but as its content differs from day to day, it may contain unexpected changes in its template, which your program is hopefully robust enough to deal with. 

Also read: What Does “Run As Administrator” Mean in Windows 10?

The primary approach is this:

 use LWP::Simple;

  my $content_url = ‘http://whatever.url.to/get/from.html’;

  my $content = get($content_url);

  die “Can’t get $content_url” unless defined $content;

  …then extract things from $content…

So, let’s say, consider freshair.npr.org, the web site for National Public Radio’s interview program Fresh Air.HTML code:

 <A HREF=”http://www.npr.org/ramfiles/fa/20020920.fa.01.ram”>Listen to

    <FONT FACE=”Verdana, Charcoal, Sans Serif” COLOR=”#ffffff” SIZE=”3″>

    <B> John Lasseter </B>

  </FONT></A>

 …

  <A HREF=”http://www.npr.org/ramfiles/fa/20020920.fa.02.ram”>Listen to

    <FONT FACE=”Verdana, Charcoal, Sans Serif” COLOR=”#ffffff” SIZE=”3″>

    <B> Singer and guitarist Jon Langford </B>

  </FONT></A>

  … plus any other segments ...

The parts that we want to extract are this:

  John Lasseter

  Singer and guitarist Jon Langford

We can get the page and match the content with this bit of code, whose regular expression we arrive at through a bit of trial-and-error:

 use LWP::Simple;

  my $content_url = ‘http://freshair.npr.org/dayFA.cfm?todayDate=current’;

  my $content = get($content_url);

  die “Can’t get $content_url” unless defined $content;

  $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines

  my @items;

  while($content =~ m{

  \s+<A HREF=”([^”\s]+)”>Listen to

  \s+<FONT FACE=”Verdana, Charcoal, Sans Serif” COLOR=”#ffffff” SIZE=”3″>

  \s+<B>(.*?)</B>

  }g) {

    my($url, $title) = ($1,$2);

    print “url: {$url}\ntitle: {$title}\n\n”;

    push @items, $title, $url;

  }

When you will run this HTML code, you will get three segments like this

 url: {http://www.npr.org/ramfiles/fa/20020920.fa.01.ram}

  title: {John Lasseter}

  url: {http://www.npr.org/ramfiles/fa/20020920.fa.02.ram}

  title: {Singer and guitarist Jon Langford}

  url: {http://www.npr.org/ramfiles/fa/20020920.fa.03.ram}

  title: {Film critic David Edelstein}

Later we can comment out that print statement and add some code to write @items to an RSS file….

 use LWP::Simple;

  my $content_url = ‘http://www.guardian.co.uk/worldlatest/’;

  my $content = get($content_url);

  die “Can’t get $content_url” unless defined $content;

  $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines

  my @items;

  while($content =~ 

   m{<A HREF=”(/worldlatest/.*?)”>(.*?)</A><BR><B>.*?</B><P>}g

  ) {

    my($url, $title) = ($1,$2);

    print “url: {$url}\ntitle: {$title}\n\n”;

    push @items, $title, $url;

  }

When we run that, that code correctly produces this list of items:

 url: {/worldlatest/story/0,1280,-2035841,00.html}

  title: {Unsolved Crimes Vex Afghanistan}

  url: {/worldlatest/story/0,1280,-2035838,00.html}

  title: {Christians Show Support For Israel}

  url: {/worldlatest/story/0,1280,-2035794,00.html}

  title: {Schroeder’s Party Wins 2nd Term}

  …and a dozen more items…

We’re ready to make both of these programs write their @items to an RSS feed — except for one thing: URLs in an RSS feed should really be absolute (starting with “http://…”), and not relative URLs like the “/worldlatest/story/0,1280,-2035794,00.html” we got from the Guardian page. Luckily the URI.pm class provides a simple way to turn a relative URL to an absolute one, given a base URL:

 URI->new_abs($rel_url => $base_url)->as_string

We can use this by just adding a “use URI;” to the start of our program, and change the end of our while loop to read like so:

   $url = URI->new_abs($url => $content_url)->as_string;

    print “url: {$url}\ntitle: {$title}\n\n”;

    push @items, $title, $url;

  }

With that change made, our program emits absolute URLs, like this:

 url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035841,00.html}

  title: {Unsolved Crimes Vex Afghanistan}

  url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035838,00.html}

  title: {Christians Show Support For Israel}

  url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035794,00.html}

  title: {Schroeder’s Party Wins 2nd Term}

  …and a dozen more items…

Basic Syntax of RSS

RSS stands for Really Simple Syndication, it is written in xml.

a minimal RSS file starts with an XML header, an appropriate doctype, and some metadata elements, like this:

 <?xml version=”1.0″?>

  <!DOCTYPE rss PUBLIC “-//Netscape Communications//DTD RSS 0.91//EN”

    “http://my.netscape.com/publish/formats/rss-0.91.dtd”>

  <rss version=”0.91″><channel>

    <title> title of the site </title>

    <description> description of the site </description>

    <link> URL of the site </link>

    <language> the RFC 3166 language tag for this feed’s content </language>

Then there’s a number of item elements like this:

 <item><title>…headline…</title><link>…url..</link></item>

And then the document ends like this:

 </channel></rss>

Hooking it all together

Once we’ve got the xml_string routine defined as above, we can then use that in a routine that takes the contents of our @items (alternating title and URL), and returns XML of it as a series of <item>…</item> elements, like so:

 sub rss_body {

    my $out = ”;

    while(@_) {

      $out .= sprintf

       ”  <item>\n\t<title>%s</title>\n\t<link>%s</link>\n  </item>\n”,

       map xml_string($_),

           splice(@_,0,2); # get the first two each time

    }

    return $out;

  }

We can test that routine by doing this:

 print rss_body(“Bogodyne rockets > 250&frac12;/share!”, “http://test”);

Its output is this:

 <item>

        <title>Bogodyne rockets > 250½/share!</title>

        <link>http://test</link>

  </item>

By this, you can convert or extract your HTML webpages into RSS webpages or you can convert HTML to RSS

This is one of the ways you can convert HTML to RSS Online, It does take time time to convert HTML to RSS but it’s worth doing it for the benefit of your websites. This works as an HTML to RSS converter. We hope this article was helpful to you.

Latest Articles

Loading...