RSS reader – Part One


All right so I’ve already figured out how to write an executable script that writes out “Hello World!” to the command line, now I need to figure out how to do something interesting.

As a result I was surfing the Internet and reading some Python documentation, trying to come up with something to do but nothing seemed interesting enough or easy enough for me to do until I visited one of my favorite websites: slashdot.org.

It was there that I noticed that RSS icon that Firefox always shows me whenever I come across a website with an RSS feed (like this one.)

So I thought to myself, ‘Hmm I wonder how hard it would be to create a simple RSS reader in Python?’ Read on to discover the results.

First I had to look at Slashdots RSS XML to know what it looks like and what my code should be looking for. If you copy and paste: http://rss.slashdot.org/Slashdot/slashdot into your web browser you will be able to take a look at the XML format. The format is basically as follows:

< ?xml version="1.0" encoding="UTF-8"?>
<rdf :RDF>
	
	<item rdf:about="http://someulr">
		<title>The Title</title>
		<link>The Link</link>
		<description>The Description</description>
		
	</item>
	<item rdf:about="http://someulr">
		<title>The Title</title>
		<link>The Link</link>
		<description>The Description</description>
		
	</item>
</rdf>

For this simple RSS reader I’m only going to pay attention to each <item> and ignore all other tags. In each <item> I’m going to want to display the <title> and the <description>. That way when we run this program well see the title and descriptions of all the items on Slashdot’s main page.

Now that I knew what I wanted to do I fired up my trusty copy of gedit and began searching the web for information. By far the best site I’ve found for Python based information is the source, python.org. There is tons of documentation and information on this site, everything a new programmer needs. There may be sites out there that are better, but I haven’t found them yet.

Now that I knew what I was going to do, I needed to find some libraries to help me with my task, since there was no way I was going to do all of this myself. Fortunately for me python has many built-in libraries that make doing complicated things a lot easier.

The first thing I needed to do was create a blank shell of a script based on my first executable python script. I called my new file pythonRSS.py and edited it so that it only contained the following:

#! /usr/bin/env python

After that I needed to include (sorry for the c++ terminology) the libraries that I needed to make my RSS reader:

import urllib2
from xml.dom import minidom, Node

The first line gives me access to the library urllib2, and the second line imports the minidom and Node submodules from the xml.dom library and makes them available without the package prefix. Had I just used the following line:

import xml.dom

I would have had to reference the Node and minidom submodules using the package prefix: xml.dom.Node, instead I am able to reference them directly as Node

Now to get down to the programming business, the first piece of the puzzle that I needed was slashdot’s RSS XML, that’s pretty easy to get using the ulrlib2 library. We download the RSS feed using the urlopen function:

url_info = urllib2.urlopen('http://rss.slashdot.org/Slashdot/slashdot')

This gives us a “file-like” object stored in url_info, if we wanted to run through the XML returned line-by-line we could actually use the following:

for lines in url_info:
	print lines

But since we have a fancy XML library at our disposal manually running through the file wouldn’t make much sense.

Now that we have the RSS XML we are going to pass it to our minidom object and get it to parse it into a document:

xmldoc = minidom.parse(url_info)

The parse function parses up the XML into the Document Object Model, or DOM (as in minidom or xml.dom), and returns the document to us. For more information on this please see the Document Object Model specification.

Now that we have the document, were going to get the root node (<rdf :RDF>) and then loop through all of its children nodes looking for <item> nodes. This is actually really simply and intuitive:

rootNode = xmldoc.documentElement
for node in rootNode.childNodes:
	if (node.nodeName == "item"):

The above is something that I am definitely learning to love about python, the ability to iterate through things easily. Instead of having to redo the same iteration code for difference classes and different types (like I have to in C++) python makes this easy and intuitive using for loops or while loops.

Once we’ve found an <item> node, were just going to do that exact same thing we did above except this time we are going to look for <title> and <description> nodes:

for item_node in node.childNodes:
	if (item_node.nodeName == "title"):
		title = ""
		for text_node in item_node.childNodes:
			if (text_node.nodeType == node.TEXT_NODE):
				title += text_node.nodeValue
		if (len(title)>0):
			print title

	if (item_node.nodeName == "description"):
		description = ""
		for text_node in item_node.childNodes:
			if (text_node.nodeType == node.TEXT_NODE):
				description += text_node.nodeValue
		if (len(description)>0):
			print description + "\n"

The above code is basically just a variation of the code that we used to get the iterate through all the children of the root node. The only difference in this case is what we do when we find them, which is to iterate through all of their child nodes that are text nodes and then store that value for printing. This is the only part of the script that felt a bit weird to me, shouldn’t it be more like node.Text or something like that? Either way it works and that is basically the end of the RSS reader. Just feed it into python or make it executabe and that’s it:

$ python pythonRSS.py

All-in-all this code didn’t take that long to write, a few google searches and I was on my way. Truthfully witting this blog post took longer to write then the code did.

Please keep in mind that there are many things wrong with this code, the most obvious is that it make no use of functions to simplify the code. But this is only part one of my RSS reader and with time I will fix up the code.

Here is the code in its entirety with some comments thrown in, alternatively you could also download the code as a text file:

#! /usr/bin/env python

import urllib2
from xml.dom import minidom, Node

""" Get the XML """
url_info = urllib2.urlopen('http://rss.slashdot.org/Slashdot/slashdot')

if (url_info):
	""" We have the RSS XML lets try to parse it up """
	xmldoc = minidom.parse(url_info)
	if (xmldoc):
		"""We have the Doc, get the root node"""
		rootNode = xmldoc.documentElement
		""" Iterate the child nodes """
		for node in rootNode.childNodes:
			""" We only care about "item" entries"""
			if (node.nodeName == "item"):
				""" Now iterate through all of the <item>'s children """
				for item_node in node.childNodes:
					if (item_node.nodeName == "title"):
						""" Loop through the title Text nodes to get
						the actual title"""
						title = ""
						for text_node in item_node.childNodes:
							if (text_node.nodeType == node.TEXT_NODE):
								title += text_node.nodeValue
						""" Now print the title if we have one """
						if (len(title)>0):
							print title

					if (item_node.nodeName == "description"):
						""" Loop through the description Text nodes to get
						the actual description"""
						description = ""
						for text_node in item_node.childNodes:
							if (text_node.nodeType == node.TEXT_NODE):
								description += text_node.nodeValue
						""" Now print the title if we have one.
						Add a blank with \n so that it looks better """
						if (len(description)>0):
							print description + "\n"
	else:
		print "Error getting XML document!"
else:
	print "Error! Getting URL"

Links that helped me get through this post:

selsine

del.icio.us del.icio.us

5 Responses to “RSS reader – Part One”

  1. learning python » Blog Archive » RSS reader - Part Three - Generator Class
    Says:

    [...] Please remember to read part one and part two. [...]

  2. wardley
    Says:

    Perfect pages… tnx

  3. garvyn
    Says:

    Very needed information found here, thank you for your work

  4. Cayson
    Says:

    You should update this for Python 3.0.
    Not much needs to be changed, just note that instead of;

    || import urllib2

    you should use this;

    || import urllib.request

    and instead of;

    || url_info = urllib2.urlopen(URL)

    you should use this;

    || url_info = urllib.request.urlopen(URL)

  5. romaklimenko
    Says:

    Thank you! It’s very very very helpful )

Leave a Reply

 

Popular Posts