Social Media Crawling
Social Media platforms provide APIs that allow us to get ‘raw’ data without the additional layout, which makes processing the data much easier. However, these APIs differ in various significant ways from the typical crawler workflow and therefore need special consideration.
In our current data model we distinguish between resources and sources.
A resource is a collection of information about an entity, such as the profile information about a user on a social media platform.
- post: A text or a reference to a multimedia object posted to the platform. A post typically has additional meta data, including the name or a reference to the user creating the post and the time when the post was created. Examples: a tweet, a Facebook status, a Pinterest pin
- profile: Information provided by a user about themselves. This excludes the list of friends/followers/… which is described by relations. Examples: name, short bio, country, date of birth, …
- relations: The list of users a user has selected on the platform. This includes lists of friends, followers and followees. If there are multiple such lists (e.g. in Twitter: followers and followees), than all lists should be retrieved, but stored in a way to make it possible to distinguish between them. This can be done by either storing the lists in separate documents or through the document format.
TODO: distinguish between in-relations and out-relations?
A source is an entry point to an API, for example a search query, that returns a collection of resources, in this case for example the posts containing the query term. Each source has an argument that further specifies the requested resources.
- query: A keyword query. The parameter is a the query string. The semantics of the query should be those of the underlying platform.
- user: One specific user. The parameter is the unique user identifier as provided by the platform.
- related: The users having a relation on the platform to a reference user. The parameter is the unique user identifier of the reference user as provided by that platform.
- location (optional): A geo-query. The parameter is interpreted according to the semantics of the platform.
Interpretation of Source/Resource pairs
|location||posts made near the location||profiles of users in that location||—|
|related||posts by the users related to the reference user||profile information of users related to the reference user||“friends-of-friends” of the reference user|
|user||posts by the user||profile information of the user||relations of the user|
|query||posts matching the query||user profiles matching the query||list of relations of users matching the query|
An API request can have multiple resource and source types. For resources, the union of the resource types is requested, such that a request for post,profile,relations with a source type of user retrieves all information available about that user. In the case of sources, the intersection of the sources is requested. For example, when requesting posts of the sources user=foo and query=bar, all posts by user foo that contain the keyword bar are requested. The union of two sources can be emulated by creating one API request for each source. (The intersection of resource types does not make sense)
API requests can additionally have parameters to restrict or modify the set of resources retrieved. We define some commonly user parameters, implementations MAY add additional parameters.
- limit: Takes a positive integer argument. Restrict the number of retrieved resources to at most the given number. In the absence of this parameter implementations can choose to retrieve all resources or at most a fixed number as defined by the implementation or the platform API.
- stream: Has no argument. Request to fetch data using the streaming API. Implementations may choose to silently ignore this parameter.
- since: Takes a date argument. Restrict the resources to those created or modified after the given date.
- before: Takes a date argument. Restrict the resources to those created or modified before the given date.
The date argument of since and before MUST be a date or a datetime as specified by ISO 8601, for example 2014-01-31, 2014-01-31T12:34:56+01:00.
Most APIs require an authentication token to access the data. A user-provided token will be passed to the API crawlers as the configuration parameter $platform.token (e.g. twitter.token). Later in the project we will provide hooks to allow the user to automatically generate a new token.
Document boundaries and identification
An API request can return collections of documents, where the individual documents can be equally or even more interesting than the collection. For example, the result for a keyword search in Twitter is a collection of tweets. Here the individual tweets are more useful in isolation than in context of the result list. For this reason the sub-documents should be extracted and stored as individual documents.
Each document needs to have a URI. The URI SHOULD have the scheme $platform: for centralized platforms (e.g. for Twitter twitter:) or $platform:// for distributed platforms (e.g. for RSS rss://). Where possible, the rest of the URI should follow the URL scheme of the website, for example a Tweet found at https://twitter.com/arcomem/status/397383573931450368 should be identified as twitter:arcomem/status/397383573931450368.
Some social media platforms provide a “streaming” API that provides updates in realtime. In the context of crawling this gives us the additional value of discovering new URLs very quickly.
API fetchers for streaming APIs are started in a separate long-running process. They should continue running until they are requested to stop through a call to their stop() method. In the case of a temporary connection loss the implementation SHOULD try to re-connect to the API as soon as possible. If there is a permanent problem (e.g. the API has blacklisted the crawler’s IP adress), the fetcher MAY throw an IllegalStateException.
For a standard resource request the fetcher should retrieve the requested resources and extract documents as described above. When the crawler requests the fetch, it also passes a Context object to the fetcher. The extracted documents should be passed back to the crawler using the methods of that object, which ensures that the documents are associated with the correct crawl.
The API fetchers MUST ensure that they conform to the rules of the API (especially the maximal number of allowed requests in any given time period) and normal crawler politeness rules. The crawler aids them by making sure that there is always at most one instance of the fetcher running per host.
Scheduling of new URIs for later Crawling
When the API fetcher encounters (in the content or the metadata) a link to an external website or to further resources on the platform outside the scope of the request, than it SHOULD pass those to the crawler using the method writeOutlink of the Context object. In the case of internal links the fetcher SHOULD try to provide the Web URL and also rewrite the URI of the current request to match the linked resource.
When the resources contain URLs that will immediately redirect to another URL (e.g. shortened URL such as bit.ly, t.co, …) and the target URL is also provided, the fetcher should not call writeOutlink for either of the URLs, but call the method writeRedirect instead.
URIs returned in this way are part of usual iCrawl processing, i.e. they are prioritized according to the relevance for the crawl. Thus the API fetchers should use this way instead of following links recursively.
All requests for a given API are specified using special URIs. Each API has a unique URI scheme. For centralized platforms the scheme should be the lowercase name of the platform such that e.g. all requests for the Twitter API start with twitter:. For decentralized platforms the host name of a specific node needs to be specified, for example requests for the Diaspora node Geraspora would start with diaspora://pod.geraspora.de/.
The request URIs have the following form:
uri = scheme 'q/' resources '/' sources ( '?' params ) scheme = platform ':' | platform '://' host '/' # e.g. 'twitter:', 'diaspora://pod.geraspora.de/' resource = 'post' | 'profile' | 'relations' resources = resource ( ',' resource ) * # e.g. 'post', 'profile,relations' source = 'query' '=' $query | 'user' '=' $userId | 'relations' '=' $userId | 'location' '=' $locationId sources = source ( ',' source ) * # e.g. 'user=foo', 'user=foo,relations=foo' param = $key ( '=' $value )? params = param ( ',' param ) * # e.g. 'limit=100', 'since=2012-12-01,stream'
- twitter:q/post/user=foo?limit=200 get the 200 last tweets by user ‘foo’
- pinterest:q/relations/user=foo,relations=foo get the friends and friends of friends of Pinterest user ‘foo’