redundant sections are marked out.
This article is based on a document export taken from the free hosting service WordPress.com. The service uses some plugins and creates some tags that may not be included in the WordPress core downloaded from WordPress.org.
One of the great things about WordPress is its portability and its popularity. It is extremely easy for a WordPress owner to move their entire site, comments and all between different hosting providers without the use of complex database languages such as SQL.
Every WordPress site provides the option to import and export data between WordPress servers. This is not restricted to the site entries themselves but can also include the post categories, tags, comments, drafts and even spam! It does all this with the WordPress Extended Rss document format, WXR.
The WXR format is based on the Really Simple Syndication or Rss specification which is a very popular dialect of XML. It has been designed as a syndication format for websites who wish to share and serialise some of their data. http://www.rssboard.org/
A web syndication specification might seem an odd choice for a site exporting tool but Rss popularity on today’s Internet, its simplicity and its expandable format through the use of 3rd party extensions make it a great choice. Being an XML dialect also means you can open up any text editor and have complete access to all data in a mark-up format that is human readable, in a layout not too different from a HTML file.
To create a WXR export file you need to login at your WordPress Dashboard, scroll down to Tools and select Export.
A filter option allows you to drill down to specific data to trim your export file size. If you are exporting the complete site I’d recommend changing the Statuses filter to ‘Published’. If left as ‘All Statuses’ the blog’s redundant auto-saved entries will be included which ineffectively duplicate the published articles. Once you have pressed Download Export File button and it has finished downloading you should have an XML document with the name of
wordpress-[yyyy]-[mm]-[dd].xml. You can open this with any text editor or even Windows Notepad. But it is preferable that you use a text editor that can parse the XML document with colourisation as it makes the document much easier to read. NotePad++ http://notepad-plus-plus.org/ is a good choice for Windows users while TextMate http://macromates.com/ was probably the best choice for OS/X.
As the title suggests in this post I will attempt to decode the content of the WordPress Extended Rss document. This means I will list in published order the Rss elements contained within a standard export and briefly describe their purpose.
This will not be a tutorial on XML or Rss and I will assume you have some understanding of both. However if this is not the case things should not be too hard to follow especially for people familiar with HTML documents.
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your site. -->
At the top of the WXR file there is a large commented section explaining the purpose of the document and in case you have forgotten instructions on how to import the file to a WordPress site.
Beyond the comments is the required <rss> element containing 5 namespace extensions as well as the Rss numeric version. The extensions include the RDF site summary content module , the well-formed web comment API , the Dublin Core metadata element set and 2 WordPress extensions . If this isn’t making too much sense then don’t worry as it is not really important unless you are developing a Rss parser.
The namespaces listed are unique with each serving specific functions that the base Rss specification does not cover. Each XML namespace starts with
xmlns: and is followed by an abbreviated title of the namespace which is usually an acronym. The URL that follows each title is a requirement and should point to a webpage that provides further information on the namespace.
Xmlns:dc="http://purl.org/dc/elements/1.1/" Is an example of the Dublin Core element set namespace.
<![CDDATA[ ]]> Some tags in a Rss or XML document contain unparsed character data enclosures. These let the XML parsers know to not process the text contained within. It is a safety measure against any illegal characters that would normally generate errors. http://www.w3schools.com/xml/xml_cdata.asp
<rss> element is the
<channel> container element. This holds all the child elements and data related to the WordPress site. You can find the closing
</rss> element at the bottom of the Rss document. At the top of the
<channel> we have the elements that are associated with the WordPress metadata.
<title> Contains the title of the site.
<link> Is the URL of the site as determined by WordPress.
<description> Is a tagline that can be modified in the Dashboard under
<pubDate> Was the time and date that the WXR document was created. It is in the RFC-822 format http://asg.web.cmu.edu/rfc/rfc822.html as required by the Rss standard. The format should be self explanatory except for the last numeric value which represents the local differential from GMT using a +/-hhmm format. Plus 2 hours from GMT would be represented as +0200. The WordPress time zone can be changed in the Dashboard under
General Settings, Timezone.
<language> Is the primary language the site is written in as determined by , General Settings, Language in the WordPress Dashboard. A list of valid codes used to represent the language can be found at http://www.rssboard.org/rss-language-codes.
<wp: wxr_version> This is our first example of an extended Rss element. We can recognise that it does not belong to the Rss specification as the element contains a colon. Left of the colon contains the elements extension while right is the element name.
wp:wxr_version is the version number for the WordPress extension Rss.
<wp:base_site_url> Is the root URL of the WordPress hosting provider.
<wp:base_blog_url> Is the root URL of the WordPress site.
Contains a complete collection of categories associated with the blog. You can view and edit the list within the WordPress Dashboard under Posts, Categories. Each category is given its own contains the following
<category> element and
3 child elements.
<wp:category_nicename> Is the category name in a URL friendly format.
<wp:category_parent> If the category belongs to a hierarchy then the parent category is listed.
<wp:cat_name><![CDATA]> The original name of the category contained within an unparsed character data enclosure.
<wp:tag> Contains a complete collection of the tags assigned to posts. You can view and edit the tags within the Dashboard under Posts,
Posts Tags. It contains the following 2 child elements.
<wp:tag_slug> Is the URL friendly name of the tag.
<wp:tag_name> Is the original name of the tag contained within an unparsed character data enclosure.
<generator> Is the name or a URL pointing to the homepage of the application that was used to create the Rss document.
<cloud> Is a pointer to the RssCloud API which is a blog monitoring service supported by WordPress.com. It enables a supporting client to receive instant notification when the blog is updated. http://www.rssboard.org/rsscloud-interface
<image> Is a logo belonging to the site that can be displayed by Rss clients. You can modify the logo
under the General Settings, Blog Picture / Icon dialog in the Dashboard . There are strict size and image formats requirements imposed by the Rss standard. http://www.rssboard.org/rss-specification#ltimagegtSubelementOfLtchannelgt
<atom:link rel="search"> Is a URL pointing to the Open Search description document supplied by WordPress. It enables supported Rss clients and web browsers an easy means to provide search terms to the blog and receive results in a standardised XML format. http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document
<atom:link rel="pub"> Is a URL pointing to the Google designed pubsubhubbub notification service that is supported by WordPress. In my opinion this is easier to implement and use then the alternative <cloud> service that offers similar functionality. http://code.google.com/p/pubsubhubbub/
That is the end of the Rss metadata related elements. Below are the list of child elements contained within the
Items are repeated multiple times as each item holds a single blog post, article or page.
Title of the blog post or page.
URL to the blog post or page.
<pubDate> Time and date
that the post was posted online.
<dc:creator> Lists the author of the
post. The element is a Dublin Core Rss extension as the Rss specification doesn’t contain any suitable elements for this role.
<guid> Is the globally unique identifier used for the identification of the
blog post item by Rss and WordPress clients. The
just means that this identifier is not a legitimate website URL and is not usable in a web browser.
<description> In Rss documents this element contains the synopsis of the item but in WXR it is left blank.
<content:encoded> Is the replacement for the restrictive Rss
<description> element. Enclosed within a character data enclosure is the complete WordPress formatted
blog post HTML tags and all.
This is an unknown element. This is a summary or description of the post often used by RSS/Atom feeds.
<wp:post_id> This is an auto-incremental, numeric, unique identification number given to each post,
article or page.
<wp:post_date> Time and date that the
post was published .
<wp:post_date_gmt> Time and date in GMT that the
post was published .
<wp:comment_status> A value stating whether public access for posting comments is
<wp:post_name> Is a unique, URL friendly nicename based on the
post title .
<wp:status> Publish status of the
post with the options;
publish, draft, pending, private, .
<wp:post_parent> The numeric identification number if the
post’s parent . I think this is applicable to WordPress pages which can be nested within each other.
<wp:menu_order> I assume is related to menu navigation of nested pages.
Post type either
<wp:post_password> A non-encrypted password used by WordPress to restrict reading access to the post.
<wp:is_sticky> A numeric Boolean value (
0 is false,
1 is true) to determine if the post as a sticky. A sticky post means the post will be displayed before all other non-sticky posts.
<category> Each category associated with the item is given 2 category
elements. The first element contains just the category as a name, while the second element contains both the category name and the URL friendly nicename attribute.
<wp:postmeta> Are containers for newer additions the WXR document format that
have been introduced after the original WXR specification. Each
<wp:postmeta> element contains 2 child elements.
<wp:meta_key> Is reference key for the meta data element.
<wp:meta_value> Is the value for the meta data element contained within a character data enclosure.
Below are some of the
<wp:meta_key> references currently used by WXR.
delicious; is data related to the Delicious social bookmarking web service. http://www.delicious.com/
geo_latitude; is the positioning location of the author when submitted the post. The value is the latitude in degrees using the World Geodetic System 1984 (WGS84) datum. It seems to be based on the Google Gears Geolocation API. http://code.google.com/apis/gears/api_geolocation.html
geo_longitude; is the positioning location of the author when they submitted the post. The value is the longitude coordinates.
geo_accuracy; is the horizontal accuracy of the above positioning values in metres.
geo_address; is the address determined by the above geolocation data.
geo_public; is a Boolean numeric value that determines if the geolocation data should be displayed in the post.
_wpas_; related tags may have something to do with the WordPress Sharing services.
reddit; is data related to the reddit social news web service. http://www.reddit.com/
<wp:comment> Is a child element for the post item that contains
12 sub-elements listed below. These sub-elements belong to the a single post comment contained within a
<wp:comment> element set.
<wp:comment_id> This is an auto-incremental, numeric, unique identification number given to each comment.
<wp:comment_author> The name of author who submitted the comment. The name value is contained within an unparsed character data enclosure.
<wp:comment_author_email> An e-mail address provided by the author of the comment.
<wp:comment_author_url> The URL of the author’s website provided by the author of the comment.
<wp:comment_author_IP> The IP address belonging to the author of the comment. The IP address is automatically recorded by WordPress.
<wp:comment_date> The date and time local to the blog that the comment was posted.
<wp:comment_date_gmt> The date and time at GMT that the comment was posted.
<wp:comment_content> The comment text enclosed within a character data enclosure.
<wp:comment_approved> A numeric Boolean value to determine if the comment is displayed.
<wp:comment_type> The type of comment. If left blank it is classed as a normal comment. A value of
trackback means it is a post request notification link http://en.wikipedia.org/wiki/Trackback.
<wp:comment_parent> The numeric identification of the parent comment used when the comment is a response to a pre-existing comment.
<wp:comment_user_id> A numeric identification belonging to the author if they were logged in when they submitted the comment.
Hopefully that extensive list helps you out.
It should be current with all the main elements in a standard WordPress Extended Rss document. If you find any mistakes, errors or know the purpose of any of the unknown elements please leave a comment.
The WordPress eXtended Rss (WXR) Export/Import, XML Document Format Decoded and Explained.