Day 8 - Scraping data

Sep 07, 2015 at 09:12 am, by Pierre Liard

Downloading content from the web can be accomplished in two different ways: by showing the website directly into the app as a regular browser would do, or by downloading only its content in order, for example, to extract some strings or numbers. The second option seems better considered the time it may take to load a complex website, and the fact that an Internet connection may not be available at all times. By simply scraping the content, most of the app would be usable offline.

A website's URL is a specific type of string. To be used as an URL in Swift, it needs to be converted with the help of the class NSUrl. This will then represent an URL that can, per Apple Developer Library, potentially contain the location of a resource on a remote server, the path of a local file on disk, or even an arbitrary piece of encoded data.

let url = NSURL(string: "")! 

To see the website into the app, a Web View element needs to be dragged from the Object Library into the storyboard, and linked to the ViewController.Swift. The method loadRequest with NSURLRequest as property will then load the url directly into the app.

webView.loadRequest(NSURLRequest(URL: url)) // webView is the variable's name for the outlet

The second method is more complicated and Rob's tutorial (lecture 59) makes a much better job in explaining it than I would do. It is interesting however to mention that Xcode blocks a simple http request (but not an https) as being unsafe. To override this limitation, a file called "info.plist" needs to be configured. It is found in the left panel near the end:


It contains a series of configuration information related to the app. I wouldn't advise to modify them except if you're sure of what you're doing. To allow an http request, an application named "Transport Security" needs to be enabled. It is shown in the blue rectangle in the above shot. Once done with this modification, the website's content should be visible, but we are for some surprise:

encoded data

It is not not really what you would expect: a series of letters and numbers. It appears so because the URL's content  is encoded. It is therefore necessary to take it out, and to convert it into an encoding that humans can read. It is for a website the standard encoding UTF8. It will then return the correct response by showing this time the web content, and not the url content:

decoded data

What is printed is actually the html code making up a website. To display this into the app, a Web View, the same as before, and a method loadHTMLString are necessary to transform this html code into a better readable layout. What is shown is simply a regular webpage without any CSS:

readable content

This was the introduction to the final app of this rich in content chapter 5: What's The Weather?. It will allow a user to enter a city and to find the weather forecast for this particular location. To build this app, Rob is using a specific website weather-forecast for two reasons: when typing a city's name, it generates an URL with its name in it, and it offers also a short summary of the weather conditions.

I don't think I have yet mentioned it , but there is in all tutorials a discussion forum where students can ask questions. If I cannot solve a problem by myself, I looked at them quickly to see if there is already a solution to my question. In this specific lecture (63), a student whose name is Michael expresses his opinion about the weather app in these words: Rob, I really...really don't like the "Current Conditions" app project! Scraping a website is completely frowned upon in the iOS community for numerous reasons: websites are dynamic so the content could change, and once it does, the app will likely break. There's legal issues concerning what could be stripped and put into an app, and finally, learning how to use a simple open-source framework is so much better in terms of abstraction, composition, future-proofing, readability, and not to mention use of knowledge. It may seem excessive, but it asks a fundamental question: to what extent is is ethical to download data from another source? The question of knowing if it is licit to do that in an exercise is without merit, because it may be taken as a pretext to do other less open operations.

Whatever the case for scraping data is, the same question remains: is it ethical? Is it right, under the cover of anonymity and of scientific study, to use personnel data? Is it acceptable that Google or especially Facebook use our data and without our knowledge for business or other less respectable motives? Rob evades his considerable hurdle by responding that a HTML scraper is not the best way to get data from another source. Better methods (namely APIs) are later covered, but this is an exercise that puts a number of individually useful concepts together into a working project. Indeed, but if I consider the problem at stake, no real response is given, probably because it goes beyond the simple technical aspect expected in a Swift 2 tutorial! I don't personally condone the use of data without the express authorization of their owner, but I have a more moderate view when those data can be considered as in the public domain. Weather forecast data are just an agglomeration of different information put together to make sense of weather patterns or predict them. It doesn't seem damageable to anybody, but I may be also wrong. In any case and whatever your take is on this issue, downloading data from another source has to be done very seriously, and case by case.

The creation of the app layout doesn't present any particular difficulty, and is actually relatively simple: three labels, one textfield and one button. Nothing really exciting for now, but the serious work will start tomorrow with the app coding. It looks to me that the main difficulty will be to find the required data, to extract them to our app, and finally to convert them to make them readable and understandable by the user. I mentioned at the beginning of this post that the class NSUrl can also be used for the path to local data or even an arbitrary piece of encoded data. This means that this class is not limited to only Internet URLs, but also to any kind of path pointing to data that can be searched for and retrieved. It goes consequently far beyond scraping data from a website and reaches to the realm of database utilization. It's surely in this way that this app has to be understood. What would be indeed modern computing without data and without their exploitation?

Because the lecture is fairly long (almost 37 minutes), I stopped the video just before starting coding. It was the right time to take a break before the real task at hand. In the meantime, here is how the app layout looks like:

weather forecast layout

Leave a comment

Login to comment a post