\chapter{Attribution License Violations Validator}
\label{validator}

%The goal of this tool is to check whether a particular web page given in any site has embedded Flickr images which violates the original license terms. 

\section{Checking for Attribution License Violations}

When someone aggregates content from many different sources, it is inevitable that  some attribution details may be accidentally forgotten. The Attribution License Violations Validator will figure out whether the user has properly cited the source by giving the due attribution to the original content creator. In other words, this is essentially a tool to help an honest person remain honest when reusing content on the Web. 

%An important consideration on the design of this tool was not to focus on the more restrictive digital rights management (DRM) approach. It was also not meant to be entirely focused on Creative Commons (CC) either. But in the current implementation it is based on CC licenses with possible extensions to scenarios modeled in Policy Languages such as ÒAccountability In RDFÓ (AIR) \cite{air}. 

According to the CC user survey \cite{nc_user_study}, out of content types such as photos, text, blogs, online journals, videos, songs, games, mash-ups, podcasts and other such media types, photos seem to be the most common type of work created on the Web. In terms of redistribution and reuse sending via email, posting on a social networking site and posting on a blog or Web site run by someone else seems to be very popular. Therefore as a proof of concept, we have implemented this tool to pinpoint any attribution license violations on Flickr photos used in composite works that are in the form of HTML pages on the Web. In order to make sure that no CC license terms of the user are violated, the author can run the CC License Violations Validator and see if some sources have been left out or whether some have been misattributed. 

Once the user gives the URI where the composite work can be found, the site crawler will search for all the links embedded in the given site and filter out any embedded Flickr photos. From each of these Flickr photo URIs, it is possible to glean the Flickr photo id. Using this photo id, all the information related to the photo is obtained by calling several methods in the Flickr API. This information includes the original creator's Flickr user account id, name and CC license information pertaining to the photo. 

If a Flickr photo has a CC license attached, regardless of the purpose for which it is used, the photo should be given proper attribution as Flickr is still using the older CC 2.0 recommendation (as of April 2009). Therefore, if it was determined that a Flickr photo on a particular page has a CC License, the tool checks for the attribution information that can be either the \emph{attributionName}, \emph{attributionURL}, source URI or any combination of those within a reasonable scoping in the containing DOM element in which the image was embedded. The `reasonable scoping' is defined to be some where within the parent or the sibling nodes in the DOM. If such information is missing, the user is presented with the details of the original content creator's name, the URI and the license it is under, enabling the user to compose the XHTML required to properly attribute the sources used. 

\section{Design and Implementation of the System}

This tool has four major components as shown in Figure \ref{fig-validator-design}.

\begin{figure}[!h]
  \centerline{\epsfig{file=images/validator.png, height=4in}}
  \caption{The Design of the Validator}
  \label{fig-validator-design}
\end{figure}

\subsection{Site Crawler} 

This will search for all the links embedded in the given Web site starting from the document at the URI input by the user. The crawler uses a Breadth-First-Search algorithm and determines if there are any embedded Flickr photos. It avoids straying outside of the Web site for safety reasons as well as for efficiency reasons, but instead simply digs down into the Web page looking for embedded Flickr images. Once it is done looking for Flickr images embedded in that page, it follows links to other pages within the same site. In order to follow these links, the crawler first parses the HTML and identify links to other resources. Then it queues them to a `to-visit' queue, and then repeats this process using the first item from the `to-visit' queue. As a link is checked, any new links that are found are loaded onto the same queue. An `already-viewed' queue is also maintained to avoid digging into any link the crawler has seen in the past. This results in breadth-first traversal. The crawler avoids moving to another site by not following non-local links.

 \subsection{Flickr Query Evaluator} 
 
 If the Site Crawler detects any embedded Flickr photos, this module will extract the photo id from the Flickr URI assuming that the URI is in one of the three formats given below:
 
 \begin{verbatim}
http://farm{farm-id}.static.flickr.com/{server-id}/{id}_{secret}.jpg
	or
http://farm{farm-id}.static.flickr.com/{server-id}/{id}_{secret}_[mstb].jpg
	or
http://farm{farm-id}.static.flickr.com/{server-id}/{id}_{o-secret}_o.(jpg|gif|png)
 \end{verbatim}
 
Using this extracted photo id, all the information related to the photo is obtained by calling several methods in the Flickr API. This information includes the original creator's Flickr user account id, name and CC license information of the photo. It then queries the Flickr API for the CC License details \footnote[1]{We used the FlickrLib python wrapper API \cite{flickrlib} to query Flickr and obtain the license information.}. The response from Flickr is obtained in the JSON data format \cite{json}, and after parsing that for the relevant license, we can determine the license attached to the photo. The license given by this query would be either \emph{All Rights Reserved} or it would include a \emph{CC license} which may have a combination of Attribution, Non-Commercial, No-Derivative and Share-Alike CC license terms. This module will also query the original photo owner's Flickr id, and then construct the Flickr user profile URI to check for attribution. This usually takes the following format:
\begin{verbatim}
 http://www.flickr.com/photos/{Flickr User ID}
 \end{verbatim}

This module has very high reliance on the URI structure used for Flickr photos and the Flickr user profiles, as it performs several string operations on these URIs to obtain the photo id, and to construct the \emph{attributionURL} to check for attribution.

\begin{figure}
  \centerline{\epsfig{file=images/validator_screenshot.png, width=1\linewidth}}
  \caption{Output from the Validator}
  \label{fig-validator-endpoint}
\end{figure}
  
%\begin{figure}[h]
%  \centerline{\epsfig{file=images/uri_format.png, height=1.5in}}
%  \caption{Format of Flickr Image URIs}
%  \label{fig-flickr}
%\end{figure}

 \subsection{License Checker} 
 
If a photo has a CC license attached with it, according to the older CC 2.0 recommendation that Flickr photos are still under (as of April 2009), the photo should be given proper attribution regardless of the purpose for which it was used. Therefore, if the Flickr Query Evaluator determines that a Flickr photo on a particular page has a CC License, it checks for the Flickr User URI or the Flickr User Name within the containing parent DOM node and the sibling nodes of the image for the attribution details (the same criteria used in the experiment described in Chapter \ref{assesment}). The reason for not doing a page level attribution check is because when two or more Flickr images are embedded in a page and if only one of those is properly attributed, this will result in an incorrect license violation detection.
    
 \subsection{Notification System} 

In the current implementation of the validator, the ``Notification System" is a Web interface which accepts the URI of a Web site to validate via a REST interface. If the validator finds any CC license violations, it will report those as shown in Figure \ref{fig-validator-endpoint}. It will output the problematic image, who the original owner of the image is, and the license it is under. It is then expected that the user will go back to her compilation and correctly attribute the image in question using the information that appears on the user interface. 
This module is named the ``Notification System" because when integrated with a closed world system such as QDOS \cite{qdos} that has Flickr user account information integrated, it can be used to send notifications to the original content creator regarding the license violation.


%Moved to the conclusion to make it consistent

%\section{Issues}

%As was already mentioned, Flickr is still using the older CC 2.5 recommendation. This recommendation does not allow users to express \emph{CC 0} - which means no rights are reserved with the image or to specify additional use restrictions via the \emph{cc:morePermissions}. 

%Also, Flickr does not allow enough granularity when specifying the licenses, i.e. the users cannot control their rights on the individual photos or on the photo sets. The license they specify will be applied to their entire collection of photos. 

%Since the Flickr API does not support these additional use restrictions, the tool described in here is not capable of exploring the extra license information to determine whether attribution should be verified using either the \emph{cc:attributionName} or the \emph{cc:attributionURL} or both or some other license or whether attribution should be checked at all.

%In addition, CC has not yet advocated the exact scoping as to where the attribution RDFa or the text indicating attribution name has to be put relative to the content that is being reused. Therefore, in the tool we have developed, we make a reasonable assumption and check for attribution details within the parent node, and the sibling nodes of the containing element of the content in question. 
