How to overcome the potential struggles when crawling?

Learn how to crawl images, videos

WP Content Crawler makes the web crawling really easy, and yet there could be some topics needs some clarification. Our articles will help you tackle the problems. Feel free to contribute in the comments below.

Table of Contents:

What is lazy loading?

When is the last time you waited more than a few seconds for a website to load? You cannot remember, right? It is not shocking. The world moves so fast, so are you.

According to a study done by Akamai, users tend to abandon the site if it is not loaded within 3 seconds.Website developers try their best reduce load time on a website. No one likes missing out the traffic.

This is where the lazy loading is utilized. Lazy loading is an way of optimization that postpones loading of non-critical resources at page load time. Considering images and videos taking much more time to load, let’s start with investigating images.

Lazy Loading Images:

Lazy-image
Image credits

Our focus will be how to crawl the lazy loading images.

On any web page, right click to anywhere on the page and click on “Inspect”. Under “Elements”, HTML code of the related page appears.

Activate the inspection tool from the left top corner of the Inspect window.

Inspection tool is at the right top side of "Inspect", web crawling
Inspection tool is at the right top side.

Click on an image to inspect its code on the HTML. Usually, images are used with elements. You will see something similar to this when you inspect the image item:

lazy-image_on_developer_tools
Attributes of elements appears as src, srcset, data-src, data-srcset and so on.

This gives us the hint of a lazy loading image.

wpcc_placeholder
Testing the placeholder image CSS Selectors with WP Content Crawler

Did you notice the “placeholder-img.jpg” at the image above? As the name suggests, src value keeps the placeholder, and the real link is reserved in some other attribute. In the image above, it is the data-src value. Hence, the image will not visible with first load of the website, but utilized with JavaScript to be loaded for later.

That is it.

What are the advantages of lazy loading?

  • It saves data. Let’s say you are on limited data plans, you might be loading stuff the visitor never gets to see.
  • It saves processing time, battery, and system resources.

How to web crawl lazy loading images?

Pretty easy.

We will swap values of the attributes by using find and replace options in WP Content Crawler.

Manipulate HTML within WPCC
Exchange element attributes feature of WP Content Crawler

Basically, exchange src attribute values and data-src attribute values of img element are being exchanged. Notice how placeholder has moved to data-src.

We have the link on src attribute on all of our images now. How cool is that?

Not happy until you see a video tutorial? Click here for your video guide about how to crawl lazy loading images.

How to crawl videos?

Crawling videos with WP Content is straightforward, too. Videos require a little bit of work though. Don’t be scared, it is not hard at all.

We will examine the video crawling case by case. So, you will know what to do when the time comes.

Keep in mind, you will do this setting only once for a site. Then you will get the crawl all the videos you like!

iFrame tags:

First of all, what are iframes and what are the uses of them?

It is considered as a kind way of sharing. You can use iFrame, but you cannot reach the code within the iFrame. Hence, websites uses them to as a gift package which contents are determined by the source.

Another good thing about iFrame is, you don’t have to keep the contents at in your WordPress. It means, people can watch YouTube videos from your website, but you will not be dealing with storage and optimization problems. It is cool.

However, iFrames come up with security risks. Hence, WordPress removes these codes from the content even if it is included. That’s why it is required to turn iframe codes into short codes to display them in your post. Below, you will find step-by-step explanations.

Let’s move on to how to crawl the videos.

We will use nytimes.com videos for example.

NY Times is utilizing iFrame.

html video tag inspection
A video inspection at nytimes.com

If you open any videos, and inspect them there is no useful video links for us to use. And yet, when you play the video then right click it a menu appears. Bingo!

Share menu of a video
Share menu of a video

We are being offered the video URL, embed code, and video ID. This is more than useful.

Get the embed code. You will receive an iframe.

embed code, iframe
embed code, iframe

Copy it and keep it at somewhere.

From the same menu get the video ID. Copy it. Then search the video ID through the inspection screen.

Number of appearance of video ID in HTML code inspection
Number of appearance of video ID in HTML code.

The video ID appears 30 times at the HTML code. We will find a nice CSS selector that will get us the video ID every time!

asset id attribute with video ID content
asset id attribute with video ID content

Feeling lucky? You should be. Found an meta element with asset_id name. This is realiable enough for a CSS selector, since it will be loaded on every page with the related video ID that I require. Awesome.

How to use iFrame inside WP Content Crawler?

The answer is by using “Custom Short Codes” under Post.

shortcodes in WPCC
Content Crawler -> Post -> Custom Short Codes

What are short codes in WordPress?

In WordPress, short codes are little pieces of codes, use to perform committed functions. Short codes are used within [brackets].

In WP Content Crawler, you will fill out 3 boxes for custom short codes.

WP content crawler short codes
Under “Custom Short Codes”, selector, attribute, short code

Selector: Remember the meta element with asset_id name attribute? Time to use it. Our selector is [name=”asset_id”].

Attribute of the selector: We are hunting for the video ID which is reserved in content attribute of the meta element. Therefore, for attribute we will just type “content”.

Short code without brackets: Name your short code, we will use this in Templates. We will name it video-id. Do not use brackets here.

HERE is the tricky part.

WP Content Crawler presents: Options specifically for the selectors!

Options box specifically for the related selector
Options box specifically for the related selector

Click on the options as shown on the image. Go to Templates. Paste the embed code you have copied up there. Remember?

Custom Short Codes Selector-> Options -> Templates
Custom Short Codes Selector-> Options -> Templates

We will get rid of iFrame because of the reasons mentioned above. WP Content Crawler defines 2 short codes: [wpcc-iframe] and [wpcc-script] in order to replace scripts and iFrames for our sake.

Transform the iframe to [wpcc-iframe] like in the image below. Do not skip any brackets or colon, this part is sensitive!

iframe to short code [wpcc-iframe] translation
iframe to short code [wpcc-iframe] translation

Notice the videoid parameter on the link. This is where we will place a short code so that even if the video id changes we will still be able to place the right id to the necessary place.

video_id is replaced with [wcc-item] short code
video_id is replaced with [wcc-item] short code

[wcc-item] short code refers to the found item by your selector. And now our video-id short code ready to use!

In order to put the short code we just prepared in use, place the [video-id] short code in Templates -> Main Post Template space.

Created short code placed under Templates
Created short code placed under Templates

Then place the [wcc-main-content] under it. Curious how does the default short codes work? Read here.

Do not forget to quick save (green button at the right bottom) and publish your work.

Almost there. One last thing before we go into production. WP Content Crawler requires your permission to allow domains for iframe and script short codes.

Under General Settings -> Post -> Short Codes fill out the spaces for as much as domain you like.

You can use wildcard character(*) here.

Here is how it works:

If you just enter nytimes.com then any subdomain.nytimes.com is not allowed.

If you enter *.nytimes.com then all the subdomain.nytimes.com is not allowed.

You can read more about how to allow domains from here.

General Settings -> Short codes allowed domains examples for iframe and script
General Settings -> Short codes allowed domains examples for iframe and script

Save changes, and we are golden.

Time to test our work. Go to Tester. Select your Site, Test Type, and Test URL.

You can check out my settings from down below.

Created short code appearing at the test results
Created short code appearing at the test results

Notice how our short code appeared on the results. Tester will not display the video itself but the code. Hence, we will Manually Crawl the post just to see if it works.

Go to Tools -> Manual Crawling Set Site, Category and any Post URL that is that is set to crawl on your settings.

Tools -> Manual crawling in wpcc
Tools -> Manual crawling

Click Crawl now. Your crawling will be finished in a few seconds, click the link appeared on your screen and check it out!

Crawled video appears on our post.
Crawled video appears on our post. Done!

You can play around your embed code to readjust the size and other features.

Understanding the lifecycle of events

It would save you from a lot of confusion to know the order of application of the settings defined by you. Especially when you are using short codes, and find and replace settings. Check out the detailed explanation from our documentation.

How to get original content by web crawling?

Crawling is fun, what about original content?

Even if the material on your website is inspired by other websites, there are ways to differentiate your crawled content. By doing so, you get to improve your search engine optimization and Google Ranks.

WP Content Crawler is integrated with numerous translation and spinning services APIs.

Below, you will find details about those services. Mostly focused on pricing and features.

If you like to know how to activate them, and for all other technical information please refer to WP Content Crawler’s documentation:
Translation and spinning.

Let’s get started!

Automatic Translation Services

Take your API key, activate in from WP Content Crawler, and you are good to go!

Here are your options with brief explanations:

Google Cloud Translation API

What do you get for free: $10 worth of free usage per month will be applied to Billed accounts for both the Basic or Advanced editions of the Cloud Translation API.

How much: You are charged by every 1 million of characters. It is $20 per million characters. Characters includes white spaces and empty queries.

Basic Translation API includes:

  • Language detection – Text translation for NMT (Neural Machine Translation), and PBMT (Phrase-Based Machine Translation) general models

Advanced Translation API includes:

  • Everything from the basic and Text translation for AutoML (Auto Machine Learning) models

Don’t worry if you are not familiar models mentioned above. Your text will be translated anyway. 🙂

Source

Microsoft Translator Text API

What do you get for free: 2 millions of characters per month

How much: There is S1, the pay-as-you-go instance which is $10 per million characters of standard translation.

Standard Translation includes:

  • Text Translation
  • Language Detection
  • Bilingual Dictionary
  • Transliteration

Custom Translation includes:

  • Translation : $40 per million chars of custom translation
  • Training : $10 per million source + target chars of training data (max. $300/training)
  • Custom model hosting : $10 per hosted custom translation model per region, per month

Microsoft also offers discount for high volumes. They have S2, S3, S4, C2, C3, and C4 packages to suit your needs.

Package details are below:

S2:

  • Standard Translation:$2,055.001/month 250M chars per month included Overage: $8.22 per million chars
  • Custom Translation:S1 rates apply for custom translation, model training and hosting

S3:

  • Standard Translation:$6,000/month Up to 1B chars per month Overage: $6 per million chars
  • Custom Translation:S1 rates apply for custom translation, model training and hosting

S4:

  • Standard Translation:$45,000/month Up to 10B chars per month Overage: $4.50 per million chars
  • Custom Translation:S1 rates apply for custom translation, model training and hosting

C2:

  • Standard Translation
    • S1 rates apply for standard (non-custom) translation
  • Custom Translation
    • Translation: $2,055/month Up to 62.5M chars per month Overage: $32.88 per million chars
    • Training: $8.22 per million source + target chars of training data (max. $300/training)
    • Hosting: $10 per hosted custom translation model per region, per month

C3:

  • Standard Translation
    • S1 rates apply for standard (non-custom) translation
  • Custom Translation Translation:
    • $6,000/month Up to 250M chars per month Overage: $24 per million chars
    • Training: $6 per million source + target chars of training data (max. $300/training)
    • Hosting: $10 per hosted custom translation model per region, per month

C4:

  • Standard Translation
    • S1 rates apply for standard (non-custom) translation
  • Custom Translation
    • Translation: $45,000.003/month Up to 2.5B chars per month Overage: $18 per million chars
    • Training: $4.50 per million source + target chars of training data (you are charged up to 30 million chars and free afterwards)
    • Hosting: $10 per hosted custom translation model per region, per month

Yandex Translate API

What do you get for free: Up to 1 million characters per 24 hours, and up to 10 millions characters per month

How much: Number of characters in the requests for the reporting period Rate ( in US dollars per 1 million characters)

less 50 000 000 characters in the requests for the Reporting period$15/per 1 million characters
from 50 000 001 to 100 000 000 characters in the requests for the Reporting period$12/per 1 million characters
from 100 000 001 to 200 000 000 characters in the requests for the Reporting period$10/per 1 million characters
from 200 000 001 to 500 000 000 characters in the requests for the Reporting period$8/per 1 million characters
from 500 000 001 to 1 000 000 000 characters in the requests for the Reporting period$6/per 1 million characters
Yandex Translation API Pricing Source

The cost calculation examples:

If you translated 55 million characters in a given reporting period:

  • For the first 50 million characters the current care is $15 per million characters
  • For the remaining 5 million characters the rate is $12 per million characters since the rate is reduced

Hence, total would be: (50*$15) + (5*$12) = $810

Source

Amazon Translate API

What do you get for free: Amazon offers a Free Tier for translation services. According to this, free tier is available for 12 months, and the limit is 2 million characters per month. If you exceed, you pay standard, pay-as-you-go service rates.

How much: $15 per million characters

Source

Automatic Spinning Services:

Spinning is basically paraphrasing a text, to change the text but keep the meaning the same. This is generally used for search engine optimization.

As for services you can use some of the most famous spinner services with WP Content Crawler: SpinRewriter API and Turkce Spin API.

Spin Rewriter

All their plans cover:

  • Unlimited articles
  • ENL Spinning Algorithm
  • Bulk Spinning
  • Mass Export

Payments are as below:

  • Monthly: $47 Lifetime
  • Single payment: $497
  • Yearly, at 60% discount right now: free for 5 days, then $77 a year which comes up with 2 bonuses: video module, and 10 free seed articles.

They are also promoting 30-Day Money Back Guarantee.

You can register to Spin Rewriter while using the link in WP Content Crawler.

Source

SpinRewriter register from WP Content Crawler
You can register to Spin Rewriter from here or by using the link under General Settings of WP Content Crawler.

Turkce Spin

This spinner is for Turkish language.

All plans include:

  • Content integrity
  • Customization
  • API Support
  • Root and appendix support
  • No word limit

Plans:

  • SX-Bronze: 100 Articles per day, 30 TL per month
  • SX-Silver: 200 Articles per day, 50 TL per month
  • SX-Gold: 300 Articles per day, 75 TL per month
  • SX-Platinum: 500 Articles per day, 30 TL per month

Source

Disclaimer: All the information here based on 3rd parties’ websites. And subject to change. WP Content Crawler cannot be responsible for any changes under translation and spinner topic.

We are more than happy to hear your feedback. Let us know what you think in the comments below!

Leave a Reply