How to bring news of a website or a blog that does not have RSS

Unfortunately, not all websites have RSS.

To bring to ContentBreeze news from websites that do not have RSS, we have created a module that lets you bring them. Bringing this news is a bit more complex than bringing news through RSS and will need minimal technical (html) knowledge.

In this tutorial we will give you the step by step instructions to do this.

What this module does is that it creates an RSS for a site that does not have RSS. Once created, you only need to add it in the admin as if it were an RSS from any other website. In this case, it will be an internal RSS generated by ContentBreeze of  a website that does not have RSS.

Note:  We are working on a simpler way to bring news from websites that do not have RSS.  If you are not familiar with the term “regular expressions”, wait on until we roll out a more user friendly way to bring news from websites that do not have RSS.  If you cannot wait, pass these instructions to someone who has a little experience with HTML or let us know.

Brief explanation of how this module works

To bring news from a website that does not have RSS, the first thing to do is find a page in such site where news are listed.
Then you have to analyze the code of the page to find patterns, ie HTML tags that encapsulate the title, description and link to the news.  Thus, whenever ContentBreeze goes to that page to find content, it will know what HTML tags it needs to look for (in that page) to find the information it needs to form the feed.

In programming this procedure is called finding regular expressions and is something which is quite common in programming. If you do not have basic programming skills (html knowledge), we recommend you ask for help to someone who has a little knowledge (html) to help you set this up. If you do not know anyone, we can do it for you.

Note that you will only need to set this up once, since once a site has been configured and the feed has been created, ContentBreeze will start bringing the news of that website like any other.

Add a website or blog without RSS

When you find a site that interests you, first take a good look at it to see if the site has RSS (which will be much easier to set up). If it does not have RSS but has interesting news, you can bring them to ContentBreeze by following this tutorial.

When you find a website you want to bring news from, keep that page open in a tab in your browser, since you will need to inspecting its html code.

Then you have to login to the admin of ContentBreeze:

1- Go to the tab “Site”

012-      Select the option  “Blog without RSS”

02

3-      Click on  “New”

032-      You will get to a screen that has two text areas that need to be populated. Here is where you will put the data related to the website you want to bring news from, that will let you extract the news you want. In this tutorial we will use the URL https://www.nzherald.co.nz/ as an example.

4– Next, you need to select in the “Mode” list the “Regular Expressions” option.

5-In “Name” you have to write the name of that web.

6- In the “URL” field you need to write the URL of the specific page you are taking the news from.

044 – After this, you need to click the “Create” button and then click the “Edit” button.

Now you have to check the tab you have open with the website you want to bring news from.  You will need to check the HTML tags that have the news headlines. This way you will be specifying (at ContentBreeze) where it needs to find the news headlines.  Opening a HTML  tag is defined by the signs: “<>” and closing it is defined by </>. For example, <head> for opening the tag and </ head> for closing it.

To activate the site inspection code, you have to press the F12 key on your keyboard while viewing the page you want to view the code from.

When you click the F12 key, you will see something similar to what you see on the next image  (you’ll see the code inspection tool at the bottom of the screen).

05 Now you can search the tag that contains the news you are interested in. An easy way to find the tag you need is to click with the right mouse on the article title and choose the option “Inspect Element”:

06We recommend you do this using Google Chrome.  If you use Chrome, when you move the mouse over the object (in this case, the title), Chrome will find the relevant tag for you.

The following tag was identified:

07The relevant HTML tag in this case is <article> because it encapsulates a full news.

Previously I explained that we should identify the HTML tag that contains the complete news. The way of knowing what that tag is in the case of Chrome, is that when the news has been selected, the tag will highlighted by a shadowy area. A different tag will encapsulate the title, the date or other information. Here we identify the tag “bkt02” as the relevant one because this tag encapsulates a full news. In Firefox, for example, the tag is encapsulated by  a dotted line that includes the news.

Configuring the module regular expressions

Once the main tag of the news have been identified, you will need to return to the regular expression module (edit screen manager at ContentBreeze) and complete the “Search by Item”:

08In this example, you will need to write the main tag of the news identified in the previous step, because it is the one that includes the full news, that is <article>

09After including the general tag, you will need to add the specific tags that are related to each one of the fields you want to include in your feed.

The next tag on the code is “<h3>” and it identifies the title of the news.

10The next one is “<a href…>” and it identifies the link. This tag should also be added to the search item.

11To form the tag of the link, you should use a wildcard. In this case, we will use the “%” to collect the string of characters for the link.

The “%” character is how the module understands that it needs to take a string of characters placed where the “%” is placed. The “%” should be enclosed between “{}”.

By using “{%}” after <a href=, you’ll be telling the module to take (from the page) all those characters found after <a href =: and bring them as the desired item.  So in the example below, we are using “{%}” two times.  The first one is to bring the link, and the second time is to bring the title of the news.

12The wildcard should go after the tag, and should “bring” any value that comes in that position.

An example of this would be if you had a link to a site with title “MySite”, the tag will be something like: <a href=”www.mySite.com”> MySite </ a>. You will need to include this wildcard to bring first “www.mysite.com” (for the link) and then also “MySite” (for the title).

It would look like this: <a href=”{%}”> {%} </ a>

Now you need to close the tag, ie indicate a </> for each tag that has been opened.
These tags are closed in reverse order, ie, the first tag will close at the end, the second will be the before last, etc..

13We still got some more information to gather so we still do not need to close the tag article yet.

The next tag that appears in the code is related to the image related to the article:

14In this case <div> </ div> is a tag that contains the image related to the news.  Let’s say that in this example, we do not want to bring the image to our feed.

To omit certain tags, you must use the “*” character and write “{*}” after the last valid tag you have written. In this case, we’ll be telling the module that whatever is between the tag (before this “*”) and where you put the next tag, should be ignored.

In this case, we will add {*} to the code that will tell the module to omit the image:

15If you wanted to add the image (related to the article) to the feed, then you should have added this field/tag too. What one considers relevant or not depends on the information you want to add to the feed you are creating.

The wildcard “*” tells the module that certain information is not important and will be omitted. Thus, the image will not be included when the module gathers the information to create this feed.

Up to this point, we have extracted the title and the link of the news. We still need to extract the body (description) of the news. We will add the description next.

16The tag <div class=”floatright”> is the tag containing the news body. We have added {%} to gather the specific description. We have also closed the tag </div>. In the image below you’ll see a tag class = “floatright” found inside the tag “<div> </ div>” which corresponds to the description:

17Now, we have all we need: the title, the link and the text for the description of the news. So all information that comes after this is irrelevant for us.  To tell the module to omit such information, we will use again the {*} character to tell the module to skip such information:

18Important: as we have reached the end of the news we have to close the tag “article”.

Bear in mind that not all sites have the same tags. Different websites will have a tag structure and tag names which are different from those of another site.  However, the logic will be the same.

Remember that:

  • <tag> always opens a tag,
  • You can omit certain information and tags by using a wildcards {*},
  • the wildcard {%} is used to collect a string of characters. In this example, <a href=”{%}”> or for example <> {%} </>
  • Every time a tag is opened, it should be closed </ tag>.
  • When you have completed the description of the item, you need to go to the footer of the form and click on “update”.

The feed is ready, and we are almost ready to bring the news from this website to our ContentBreeze website:

19In the image above, you’ll see that our wildcard “%” has now an identification number. The {% 1} corresponds to the link tag <a href>, the {% 2} corresponds to the article title and {%3} corresponds to the news description.

Final steps:

At the bottom of the page in the admin you will find a form that needs to be completed.

20In the “RSS title” field you need to write the site title’s to identify it. In this case, we will call this internal feed, ” Nz Herald”.
In the field ” RSS Link ” you will need to write the URL of the website from which you are extracting the news from. In description, you can write other information to identify the website better.

In the last text areas, you need to add the following information:

• In “Item Title” you need to write the identification number that you got. In this case, the title of the story is identified with “% 2”
• In “Item URL” you need to write the identification number related to the link, in this case, it will be “% 1 “.
• In “Item description” you need to write the identification number related to the description of the news.  In this case it will be ” %

3. After completing these fields, you need to click on the button ” Update “:

21

Now your ContentBreeze account is ready to start bringing news through the internal feeds you just set up.

Finish the configuration

The last step is setting up this website that has no RSS in the admin. In the “Web without RSS” option you will see the feed that has been generated. You only need to click the button “File” to display it in xml format.

22The last step is to upload this site as (NZ Herald) as a new website to the admin, just as you would upload any other. In the Feed URL field, you need to add the url of the feed you just created in this module.

Leave a Reply

Your email address will not be published. Required fields are marked *