\chapter{Summary}
\label{summary}

\section{Contributions}

We have provided an assessment on the number of license violations on the Web, and thus highlighted the need to make Web users more aware of the licensing options available to them. The tools we have implemented provide policy awareness using existing Semantic Web technologies. These tools also provides a platform to use the data exposed on the Semantic Web. 

The tools are:

\begin{enumerate}

\item \emph{Creative Commons License Violations Validator for Flickr Images}: 

This validator can be used to check if a certain Web site has used content inappropriately and thus violated anybody else's license terms when reusing Flickr images on the Web. If there are any violations, the validator provides the necessary information to include to make it license-compliant.

\item \emph{Semantic Clipboard}:

This detects reusable images while a user is browsing and enables them to seamlessly integrate those images along with their metadata. Such addition of license metadata enables the transparent transfer of content between applications, and makes people more aware of the policies associated with content reuse. 

\end{enumerate}

\section{Challenges}

There are several limitations in the software we have developed for this project. Some of them are due to the lack of clear-cut definitions of license terms that can be directly transcoded to programmable verification methods. Some are due to portability problems. Some are due to user practices, while some are due to wide adoption of certain technologies. These challenges, and the possible future directions to circumvent those are given in the following sections.

\subsection{Challenges for the Attributions License Violations Validator}

\subsubsection{Tracking Provenance}

The validator sufficiently addresses the problem of inappropriate content reuse, specifically Flickr image reuse, on the Web. However we are unable to guarantee that it will handle all the cases. A malicious user could  very easily change the image URI by uploading the same image on a different server. The license violations detection can only work if the the image URI is actually linked from the Flickr site. 

Another interesting scenario is that any uploader can assign CC licenses to images on Flickr regardless of that user having the actual rights to do so. In other words, if someone uploads a copyrighted photo from \emph{Getty Images} and assigns a CC license on Flickr, and an innocent user downloads and uses this photo, then that user will be violating copyright law without the user knowing it. 

Therefore, we need to have some capability to track provenance of image data, and be able to identify whether a particular image has been used elsewhere in a manner that violates the original license terms. There are two hurdles in achieving that. \emph{First,} image transformations are very easy to perform with image manipulation programs available today. Hence, even if there is a giant database of images (perhaps implemented by indexing the fingerprints of all the images on the Web), if one chooses to cheat, it can be easily done by changing a small unnoticeable pixel in the image and changing the fingerprint as it is non-invariant to image transformations. Therefore, we are assuming that this tool will be used in a  non-adversarial manner by honest people who wish to validate their documents against attributions license violations. \emph{Second}, provenance might mean different things to different people. For example, one might think of provenance preservation as linking to the original URI, without embedding it in some other location. However, one might also argue that provenance would be preserved if the content is embedded in a new location, but mentions the original URI from where it came from. In the validator we have implemented, we assume that provenance is preserved by linking to the original image URI from Flickr. 

\subsubsection{Subsequent Changes in the License}

There is no law preventing a user from changing the license given to an image at any point in time. This can even be from a CC license to copyright protected. This makes the reuser vulnerable to copyright infringement claims by the original owner, putting the burden of proof on the reuser's shoulders. Therefore, a mechanism to record temporal changes in the licensing information will be very useful.

\subsubsection{Scope of the Human Readable Attribution Notice}

One of the major assumptions we have made in developing this tool is that attribution to be specified within the parent node or the sibling nodes of the containing image element. Otherwise we classify it an instance of misattribution. This assumption works practically and seems to be the most logical thing to do. However, since there is no standard agreement as to what the correct scoping for attribution is, this assumption can give a wrong validation result. The solution to this problem can be in two folds. (1) CC should give a guideline as to what the correct scoping of attribution should be relative to the content that is attributed. (2) Flickr (any other such service) should expose the license metadata as RDF, instead of providing an API to query with. The latter method is preferred as it enables data interoperability and relieves the tool authors from having to write data wrappers  for each service.

\subsubsection{Limited Flickr Support for License Expression}

Flickr is still using the older CC 2.0 recommendation which specifies that the license metadata be included in HTML comments. This makes it impossible to process the metadata programmatically. This recommendation has been superseded by ccREL \cite{hal08cc} that was published in March 2009. But Flickr has not yet implemented this recommendation, and as a result, users are not able to express \emph{CC-Zero} (which means no rights are reserved with the image), or specify additional use restrictions via the \emph{cc:morePermissions}, or use many of the other numerous additions in the new recommendation that supports richer semantics including the \emph{cc:attributionName} and \emph{cc:attributionURL}. 

% The following paragraph is disputable - as I have seen evidence otherwise
%Also, Flickr does not allow enough granularity when specifying the licenses. In other words, users cannot control their rights on the individual photos or on the photo sets. The license they specify will be applied to their entire collection of photos. It would be useful if users can specify the licenses selectively. 

\subsection{Challenges for the Semantic Clipboard}

\subsubsection{License Granularity in the HTML DOM}

Content creators have the freedom to assign whatever the license they see fit to their works. However, there is no standard agreement as to the granularity to which the license is applied. For example the decision about whether to apply the license to the entire document where the content is found, or whether to apply it to the individual content item is not specifically stated in the ccREL. Semantic Clipboard assumes that a license is applied to each image, and extracts the RDF metadata for that image in order to ascertain what the license for that particular image is. This assumption holds if the original content creator uses a tool when publishing her work that will automatically embed the license metadata for each content item rather than giving an all-encompassing license at the page level. 

\subsubsection{Browser Dependence}

One of the major drawbacks of the Semantic Clipboard is that it is Firefox browser-dependent. Firefox has only 22.48\% of the browser market share according to \cite{browser-market}. Therefore, it seems that this application will not be widely adopted if it is not made available in all of the other major browsers. Developing an Opera Widget, a Chrome Extension, a Safari Plugin, an Internet Explorer Content Extension or completely making this tool browser independent seem to be a viable future direction of the project in terms of making the tool available to the masses.

\subsubsection{Usability vs. Operating System Independence}

It would be convenient to copy the attribution XHTML as \emph{rich text} into a text editor which accepts hyperlinked text and displays it without the source text. Including this feature would be useful from a usability perspective, but it would raise concerns on the portability of the Semantic Clipboard across different operating systems. The Semantic Clipboard needs to have knowledge of the allowed data formats in the target applications. But this is a very operating system dependent parameter. The \emph{data flavor} used when copying an image in the current implementation is `ASCII text'. Thus the application is portable to any Operating System. We have only tested the application on Mac OS X 10.5.6 and Ubuntu 9.04, but we assume that it should work in other platforms. 

\section{Future Work}

\subsection{Check for Other Types of License Violations}

Detecting whether an image has been used for any commercial use would be of much interest to content creators, especially if the second use of the image decreases the monetary value of the original image. But as was discussed in Chapter \ref{background}, from the point of view of the content consumers, it is hard to give a precise definition as to what constitutes a \emph{Commercial Use}. For example, if the images are used in a Web site that has subscribed to a dynamic advertisement generation service, it can be easily argued that the advertisements that appear on the page and the image content that is embedded in the page have no direct correlation. However, as discovered by the CC survey \cite{nc_user_study}, large number of content creators actually generate revenue indirectly through advertisements. Therefore, if the definition of \emph{non commercial use} becomes clearer and much more objective, the validator can be used to check for such violations as well.
%Since certain Web sites can be white-listed as non-profit and non-commercial, perhaps based on the specific top level domain name such as `.org' associated with the Web site, the tool can make the inference that embedding an image in such a Web site would not lead to a non-commercial license violation. 

It would also be interesting to check for share-alike license violations. These violations happen when a conflicting license is given when the content is reused. The solution, therefore, is to check the RDFa in both the original page and in the page where the image was embedded to see if the latter is the same as the original CC license. 

\subsection{Give Credit to the Original Content Creator as Requested}

%In the world of digital data, licenses are utilized to grant certain rights to the user by actually reducing the rights held by the user. For example, data created by user X are protected for X's use completely and fully by the copyright law. However, X can grant a license to use the data for Y's work provided that Y use it for non-commercial purposes. If that person start using X's data for commercial purpose, X can revoke the license by asking Y to stop using the data. X can further obligate the Y, to do something when using X's data Ñ for example, X can ask Y to pay X some money, or give X the credit the way X wants it. 

The requirement for attribution is a form of social compensation for reusing one's work. While the mentioning of one's name or the URI when attributing trades attention to an individual, other forms of \emph{attention mechanisms} can also be implemented. For example, a content creator can obligate the users of her works to give monetary compensation or require that they include certain ad links in the attribution XHTML or give attribution in an entirely arbitrary manner. These extra license conditions can be specified using the \emph{cc:morePermissions}. Tools could be built to interpret these conditions and give credit to the original creator as requested.

\subsection{Extend to Other Media Types}

We have only explored one domain of content, specifically image reuse on the Web. However, there are billions of videos uploaded on YouTube, and potentially countless number of documents on the Web, which have various types of licenses applied. While organizations such as Mobile Picture Association of America (MPAA), Recording Industry Association of America (RIAA) and other such big organizations are working towards preserving the rights of the works of their artists on YouTube, other video and audio sharing sites and peer-to-peer file sharing networks, there are no viable alternatives for ordinary users who intend to protect their rights using CC. Thus a solution of this nature which detects CC license violations based on the metadata of free-floating content will be very useful. 

\subsection{License Granularity}

The tool we have developed only works if every image found on the page has it's own license. Possible extensions of this tool would be to determine the license of an image when it does not have a license of it's own, but is contained within a page that has a license or is a member of a set of images (e.g. a photo album) that has a license. Protocol for Web Description Resources (POWDER) \cite{powder}, a mechanism that allows the provision of descriptions for groups of online resources, seems like a viable method to making the license descriptions about the resources explicit. Tool builders can then rely on the POWDER descriptions to help users to make appropriate content reuse decisions. 

 %From a practical standpoint, it can be assumed that only one license is applied to the entire page or a set of images. 

\subsection{Persistent Data Storage}

Currently, images that are copied with their metadata to the Semantic Clipboard are overwritten when new content is copied to the clipboard. In other words, the tool only supports copying of one image at a time. It would be useful to have a persistent data storage to register content along with their license metadata, index them, make them persistent across browser sessions and use it whenever the user needs it.

\subsection{User Study}

It would be interesting to measure how user behavior changes with the introduction of tools such as the License Violations Validator and the Semantic Clipboard. A measurement of the increased (or decreased or unchanging) level of license awareness would be an important metric in determining the success of these tools. Therefore, we would like to perform a controlled user study to determine how many violations a set of users make before and after the introduction of these tools during a limited period of time.

\subsection{Widening Social Criteria}

The central idea presented in this thesis is not limited to Creative Commons licenses. One could also combine it with a reasoning engine that reasons over security clearances to produce policy-aware tools that can calculate the appropriate level of data classification. Similarly, one could combine it with a tool that is aware of assertions regarding data reuse and create a tool that checks for violations of citizens' privacy \cite{danny_info_account}. Therefore, we hope the work on this thesis will spur research on how best to preserve data provenance in other domains.


\section{Conclusion}

As the license violations experiment indicated, there is a strong lack of awareness of licensing terms among content reusers. This raises the question as to whether the machine readable licenses are actually working. Perhaps more effort is needed to bring these technologies to the masses, and more tools are needed to bridge the gap between the license-aware and the license-unaware. An important research question that stems from this work is the method of  provenance preservation of content on the Web. We have trivially assumed URIs as the provenance preservation mechanism when developing the tools described in this thesis. However, it would be an interesting challenge to track provenance based on the content itself, without having to rely on a unique identifier. This would enable us to find out license violators easily, in addition to validating one's own work for any violations. Also, programmatically determining whether a particular reuse of material is allowable or not is subjective, especially since some of the laws and standards have been quite ambiguous in defining these terms. 

%Another question which is at a much more philosophical level is whether every resource on the web be licensed and that fact be made aware of? We have demonstrated that it is possible to do so with tangible media such as images and text. We can easily extend to other types such as video and audio as well. However what about intangible intellectual property such as ideas expressed on the Web? 

In general, social constraints are functions of any part of the blossoming Social Web we are experiencing today. As we are living in an era of increasing user generated content, these constraints can be used to communicate the acceptable uses of such content. We need tools, techniques and standards that strike an appropriate balance between the rights of the originator and the power of reuse. The rights of the originator can be preserved by expressing what constitutes appropriate uses and restrictions using a rights expression language. These rights will be both machine and human readable. Reuse can be simplified by providing the necessary tools by leveraging these machine readable rights to make the users more aware of the license options available and ensure that the user be license or policy compliant. Such techniques can be incorporated in existing content publishing platforms or validators  or even Web servers to make the process seamless. This thesis has demonstrated several tools that lay the foundation for such policy aware systems. We hope these will stimulate research in this area in the future.