github

mmcdole / gofeed

  • суббота, 16 апреля 2016 г. в 03:11:29
https://github.com/mmcdole/gofeed

Go
Parse RSS and Atom feeds in Go



gofeed

Build Status Coverage Status Go Report Card License

The gofeed library is a robust feed parser that supports parsing both RSS and Atom feeds. The universal gofeed.Parser will parse and convert all feed types into a hybrid gofeed.Feed model. You also have the option of parsing them into their respective atom.Feed and rss.Feed models using the feed specific atom.Parser or rss.Parser.

Supported feed types:
  • RSS 0.90
  • Netscape RSS 0.91
  • Userland RSS 0.91
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0

It also provides support for parsing several popular extension modules, including Dublin Core and Apple’s iTunes extensions. See the Extensions section for more details.

Table of Contents

Overview

Universal Feed Parser

The universal gofeed.Parser works in 3 stages: detection, parsing and translation. It first detects the feed type that it is currently parsing. Then it uses a feed specific parser to parse the feed into its true representation which will be either a rss.Feed or atom.Feed. These models cover every field possible for their respective feed types. Finally, they are translated into a gofeed.Feed model that is a hybrid of both feed types. Performing the universal feed parsing in these 3 stages allows for more flexibility and keeps the code base more maintainable by seperating RSS and Atom parsing into seperate packages.

Diagram

The translation step is done by anything which adheres to the gofeed.Translator interface. The DefaultRSSTranslator and DefaultAtomTranslator are used behind the scenes when you use the gofeed.Parser with its default settings. You can see how they translate fields from atom.Feed or rss.Feed to the universal gofeed.Feed struct in the Default Mappings section. However, should you disagree with the way certain fields are translated you can easily supply your own gofeed.Translator and override this behavior. See the Advanced Usage section for an example how to do this.

Feed Specific Parsers

The gofeed library provides two feed specific parsers: atom.Parser and rss.Parser. If the hybrid gofeed.Feed model that the universal gofeed.Parser produces does not contain a field from the atom.Feed or rss.Feed model that you require, it might be beneficial to use the feed specific parsers. When using the atom.Parser or rss.Parser directly, you can access all of fields found in the atom.Feed and rss.Feed models. It is also marginally faster because you are able to skip the translation step.

However, for the vast majority of users, the universal gofeed.Parser is the best way to parse feeds. This allows the user of gofeed library to not care about the differences between RSS or Atom feeds.

Basic Usage

Universal Feed Parser

The most common usage scenario will be to use gofeed.Parser to parse an arbitrary RSS or Atom feed into the hybrid gofeed.Feed model. This hybrid model allows you to treat RSS and Atom feeds the same.

Parse a feed from an URL:
fp := gofeed.NewParser()
feed, _ := fp.ParseURL("http://feeds.twit.tv/twit.xml")
fmt.Println(feed.Title)
Parse a feed from a string:
feedData := `<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>`
fp := gofeed.NewParser()
feed, _ := fp.ParseString(feedData)
fmt.Println(feed.Title)
Parse a feed from an io.Reader:
file, _ := os.Open("/path/to/a/file.xml")
defer file.Close()
fp := gofeed.NewParser()
feed, _ := fp.Parse(file)
fmt.Println(feed.Title)

Feed Specific Parsers

You can easily use the rss.Parser and atom.Parser directly if you have a usage scenario that requires it:

Parse a RSS feed into a rss.Feed
feedData := `<rss version="2.0">
<channel>
<webMaster>example@site.com (Example Name)</webMaster>
</channel>
</rss>`
fp := rss.Parser{}
rssFeed, _ := fp.Parse(strings.NewReader(feedData))
fmt.Println(rssFeed.WebMaster)
Parse an Atom feed into a atom.Feed
feedData := `<feed xmlns="http://www.w3.org/2005/Atom">
<subtitle>Example Atom</subtitle>
</feed>`
fp := atom.Parser{}
atomFeed, _ := fp.Parse(strings.NewReader(feedData))
fmt.Println(atomFeed.Subtitle)

Advanced Usage

Parse a feed while using a custom translator

The mappings and precedence order that are outlined in the Default Mappings section are provided by the following two structs: DefaultRSSTranslator and DefaultAtomTranslator. If you have fields that you think should have a different precedence, or if you want to make a translator that is aware of an unsupported extension you can do this by specifying your own RSS or Atom translator when using the gofeed.Parser.

Here is a simple example of creating a custom Translator that makes the /rss/channel/itunes:author field have a higher precedence than the /rss/channel/managingEditor field in RSS feeds. We will wrap the existing DefaultRSSTranslator since we only want to change the behavior for a single field.

First we must define a custom translator:

type MyCustomTranslator struct {
    defaultTranslator *DefaultRSSTranslator
}

func NewMyCustomTranslator() *MyCustomTranslator {
  t := &MyCustomTranslator{}

  // We create a DefaultRSSTranslator internally so we can wrap its Translate
  // call since we only want to modify the precedence for a single field.
  t.defaultTranslator = &DefaultRSSTranslator{}
  return t
}

func (ct* MyCustomTranslator) Translate(feed interface{}) (*Feed, error) {
    rss, found := feed.(*rss.Feed)
    if !found {
        return nil, fmt.Errorf("Feed did not match expected type of *rss.Feed")
    }

  f, err := ct.Translate(rss)
  if err != nil {
    return nil, err
  }

  if rss.ITunesExt != nil && rss.ITunesExt.Author != "" {
      f.Author = rss.ITunesExt.Author
  } else {
      f.Author = rss.ManagingEditor
  }
  return f
}

Next you must configure your gofeed.Parser to utilize the new gofeed.Translator:

feedData := `<rss version="2.0">
<channel>
<managingEditor>Ender Wiggin</managingEditor>
<itunes:author>Valentine Wiggin</itunes:author>
</channel>
</rss>`

fp := gofeed.NewParser()
fp.RSSTrans = NewMyCustomTranslator()
feed, _ := fp.ParseString(feedData)
fmt.Println(feed.Author) // Valentine Wiggin

Extensions

Every element which does not belong to the feed's default namespace is considered an extension by gofeed. These are parsed and stored in a tree-like structure located at Feed.Extensions and Item.Extensions. These fields should allow you to access and read any custom extension elements.

In addition to the generic handling of extensions, gofeed also has built in support for parsing certain popular extensions into their own structs for convenience. It currently supports the Dublin Core and Apple iTunes extensions which you can access at Feed.ItunesExt, feed.DublinCoreExt and Item.ITunesExt and Item.DublinCoreExt

Invalid Feeds

A best-effort attempt is made at parsing broken and invalid XML feeds. Currently, gofeed can succesfully parse feeds with the following issues:

  • Unescaped/Naked Markup in feed elements
  • Undeclared namespace prefixes
  • Missing closing tags on certain elements
  • Illegal tags within feed elements without namespace prefixes
  • Missing "required" elements as specified by the respective feed specs.
  • Incorrect date formats

Default Mappings

The DefaultRSSTranslator and the DefaultAtomTranslator map the following rss.Feed and atom.Feed fields to their respective gofeed.Feed fields. They are listed in order of precedence (highest to lowest):

gofeed.Feed RSS Atom
Title /rss/channel/title
/rdf:RDF/channel/title
/rss/channel/dc:title
/rdf:RDF/channel/dc:title
/feed/title
Description /rss/channel/description
/rdf:RDF/channel/description
/rss/channel/itunes:subtitle
/feed/subtitle
/feed/tagline
Link /rss/channel/link
/rdf:RDF/channel/link
/feed/link[@rel=”alternate”]/@href
/feed/link[not(@rel)]/@href
FeedLink /rss/channel/atom:link[@rel="self"]/@href
/rdf:RDF/channel/atom:link[@rel="self"]/@href
/feed/link[@rel="self"]/@href
Updated /rss/channel/lastBuildDate
/rss/channel/dc:date
/rdf:RDF/channel/dc:date
/feed/updated
/feed/modified
Published /rss/channel/pubDate
Author /rss/channel/managingEditor
/rss/channel/webMaster
/rss/channel/dc:author
/rdf:RDF/channel/dc:author
/rss/channel/dc:creator
/rdf:RDF/channel/dc:creator
/rss/channel/itunes:author
/feed/author
Language /rss/channel/language
/rss/channel/dc:language
/rdf:RDF/channel/dc:language
/feed/@xml:lang
Image /rss/channel/image
/rdf:RDF/image
/rss/channel/itunes:image
/feed/logo
Copyright /rss/channel/copyright
/rss/channel/dc:rights
/rdf:RDF/channel/dc:rights
/feed/rights
/feed/copyright
Generator /rss/channel/generator /feed/generator
Categories /rss/channel/category
/rss/channel/itunes:category
/rss/channel/itunes:keywords
/rss/channel/dc:subject
/rdf:RDF/channel/dc:subject
/feed/category
gofeed.Item RSS Atom
Title /rss/channel/item/title
/rdf:RDF/item/title
/rdf:RDF/item/dc:title
/rss/channel/item/dc:title
/feed/entry/title
Description /rss/channel/item/description
/rdf:RDF/item/description
/rss/channel/item/dc:description
/rdf:RDF/item/dc:description
/feed/entry/summary
Content /feed/entry/content
Link /rss/channel/item/link
/rdf:RDF/item/link
/feed/entry/link[@rel=”alternate”]/@href
/feed/entry/link[not(@rel)]/@href
Updated /rss/channel/item/dc:date
/rdf:RDF/rdf:item/dc:date
/feed/entry/modified
/feed/entry/updated
Published /rss/channel/item/pubDate /feed/entry/published
/feed/entry/issued
Author /rss/channel/item/author
/rss/channel/item/dc:author
/rdf:RDF/item/dc:author
/rss/channel/item/dc:creator
/rdf:RDF/item/dc:creator
/rss/channel/item/itunes:author
/feed/entry/author
Guid /rss/channel/item/guid /feed/entry/id
Image /rss/channel/item/itunes:image
/rss/channel/item/media:image
Categories /rss/channel/item/category
/rss/channel/item/dc:subject
/rss/channel/item/itunes:keywords
/rdf:RDF/channel/item/dc:subject
/feed/entry/category
Enclosures /rss/channel/item/enclosure /feed/entry/link[@rel=”enclosure”]

Dependencies

License

This project is licensed under the MIT License

Credits

  • Mark Pilgrim for his work on the excellent Universal Feed Parser Python library. This library was referenced several times during the development of gofeed. Many of its unit test cases were also ported to the gofeed project as well.
  • Dan MacTough for his work on node-feedparser. It provided inspiration for the set of fields that should be covered in the hybrid gofeed.Feed model.
  • Matt Jibson for his date parsing function in the goread project.
  • Jim Teeuwen for his method of representing arbitrary feed extensions in the go-pkg-rss library.