\chapter{Background}
\label{background}

\section{Content Reuse Defined}

Content reuse, or in other words ``mash-ups" have existed for as long as content has existed. Musicians routinely use other songs and tunes in their compositions. Collage art is considered to be creative, and even original although it is composed from many different sources. Scientists routinely utilize data from different sources to conduct their own experiments. However, mash-ups, as we know them now, are a peculiarly digital phenomenon of the Web age. They are entirely a product made possible by the portable, mixable and immediate nature of digital technology. 

Reuse detection is important in domains such as plagiarism detection and even in biological sequence mining. Significant research has been carried out to detect reuse of text. This includes information retrieval techniques as mentioned in \cite{qsign, scam}, where the document is treated as a sequence of symbols and substring based fingerprints are extracted from the document to determine repetitive patterns. 

A potential legal problem arises when more than one legally encumbered content or data streams are bound together in the form of a mash-up. The users of the original content should remain within the bounds of the permitted use of the components comprising the mash-up. They can choose to ignore these permissions, or follow them. Either way, this creates a burden on them. Ignoring the license terms puts them at the peril of breaking the law, and following them slows the creative process.

\section{Policies for Rights Enforcement on the Web}

Policies governing the reuse of digital content on the Web can take several forms. It can be the \emph{Digital Rights Management} (DRM) approach exercised for example, by some commercial systems when distributing their media content. It could also be the \emph{Copyrights} alternative, that requires anybody reusing the content, either to have explicit permission from the original content creator, or express the original content creators rights as specified by the CC. 

\subsection{Digital Rights Management (DRM)}

Distribution and usage of copyrighted content is often controlled by up-front policy enforcement mechanisms such as DRM. These systems usually restrict access to the content, or prevent the content from being used within certain applications. The core concept in DRM is the use of digital licenses, which grant certain rights to the user. These rights are mainly usage rules which are defined by a range of criteria, such as frequency of access, expiration date, restriction to transfer to another playback device, etc. There are many applications that enforce DRM, such as Apple iTunes or Microsoft Windows Media Rights Manager. An example of a DRM enforcement would be a DRM software enabled playback device not playing a DRM controlled media transferred from another playback device, or not playing the media after the rental period for that media has ended. 

The use of DRM to express and enforce rights on content on the Web raises several concerns. First, the consumer privacy and anonymity are compromised. The authentication process in the DRM system usually requires the user to reveal her identity to access the protected content. This could lead to profiling of user preferences, and monitoring of user activity at large \cite{DBLP:conf/ccs/FeigenbaumFSS01}. 
%The other concern is the inability to use content under the ÒFair UseÓ doctrine \cite{fair_use}. 
Another huge criticism of DRM is the usability of the content, where the user is limited to using proprietary applications to view or play the digital content requiring vendor lock-in.

\subsection{Copyright through Licensing}

Copyright is essentially the right to make copies. Prior to the Berne Convention, a multilateral treaty to which most countries now adhere \cite{berne}, content creators were required to explicitly give the \emph{copyright} (\copyright) notice to indicate that all rights are reserved for their works. Otherwise, their creations will fall under the public domain. However, Article 5(2) of the convention provides that ``the enjoyment and the exercise of [copyrights] shall not be subject to any formality" which means that any content that does not have any explicit license  will be protected by copyright law, whether the author is aware of it or not. Many argue this to be over inclusive to inhibit creativity \cite{copyright_debate}. For example, when reusing content that are copyrighted, the reuser faces many obstacles such as locating the copyright holder, obtaining permission, and usually having to pay royalty.

In the context of digital content, a `license' is describes the conditions of usage of copyrighted material. 
%Licenses are covered by federal copyright laws. 
A license on a digital file can exist whether or not there are any corresponding users for it. A user should abide by the license that covers the usage, and if any of the conditions of usage described in that license are violated, then the user should cease using that content. 

\subsubsection{Choosing a License}

Choosing a license can be very confusing. In the open source regime alone there are as many as 90 different licenses \cite{license_babel}. If choosing a license is tough for the creator, imagine what understanding the license is like for the end user.  

Creative Commons, a non-profit organization has been striving to provide a simple, uniform, and understandable set of licenses that content creators can use to issue their content under. These licenses provide a solution to the problem of copyright on the Web, while ensuring that the culture of reusing existing works to foster creativity is not hindered. Often, Web authors post their content with the understanding that it will be quoted, copied, and reused. Further, they may wish that their work only be used with  attribution, used only for non-commercial use, distributed with a similar license and will be allowed in other free culture media. To allow these use restrictions CC has composed four distinct license types: \emph{BY} (attribution), \emph{NC} (non-commercial), \emph{ND} (no-derivatives) and \emph{SA} (share-alike) that can be used in combinations that best reflects the content creator's rights.   

In order to generate the license XHTML easily, CC offers a license chooser that is hosted at \emph{http://creativecommons.org/license}. There, the users are given options to select for their works, and it will generate a snippet of XHTML that contains the RDFa \cite{rdfa} they need to embed when they are publishing the content on the Web.

\subsubsection{Creative Commons Rights Expression Language (ccREL)}

\emph{ccREL} \cite{hal08cc} is the standard recommended by the CC for machine readable expression of the meaning of a particular license. Content creators have the flexibility to express
their licensing requirements using this rights expression language and are not forced into choosing a pre-defined license for their works. Also, they are free to extend licenses to meet their own requirements. ccREL allows a publisher of a work to give additional permissions beyond those specified in the CC license with the use of the \emph{cc:morePermissions} property to reference commercial licensing brokers or any other license deed, and a \emph{dc:source} to reference parent works. Therefore, unlike in older CC recommendations, it is also possible to have content under a CC license that does not require attribution.

\subsubsection{Anatomy of a CC License in RDFa}

%Unclear - but correct
%\begin{figure}[!h]
%  \centerline{\epsfig{file=images/scenario_modified.jpg, width=1\linewidth}}
%  \caption{Illustration of Usage of CC Licensed Content}
%  \label{fig-cc-scenario-2}
%\end{figure}

%Clear Image
\begin{figure}[!h]
  \centerline{\epsfig{file=images/scenario.jpg, width=1\linewidth}}
  \caption{Illustration of Usage of CC Licensed Content}
  \label{fig-cc-scenario-2}
\end{figure}

Figure \ref{fig-cc-scenario-2} illustrates a simple CC license applied to an image. Suppose Alice is an avid Flickr user and she uploads her photos regularly. In her Flickr account settings she has applied \emph{CC-BY-2.0} to all her photos. This means she allows anybody to reuse her photos as long as they properly attribute her as given in her CC license terms.  Flickr does not yet markup this license information in RDFa, and is still under the older CC 2.0 specification. However, the CC license badge hyperlinked to the CC \emph{Deed Page} \footnote[1]{ A Deed Page is a human readable page that describes the license type.} of the corresponding license will be displayed in all of her photo album pages.

Supposing that Alice uses a service which marks up the license metadata in RDFa, the code snippet shown in Figure \ref{fig-cc-rdfa-license} shows how she would express her rights for one of her photos identified by the following URI:
http://flickr.com/photos/alice/somephoto.jpg. 
The RDF data are given in red.

\begin{figure}[!h]
\begin{alltt}
\singlespace{
<div about="\textcolor{red}{http://flickr.com/photos/alice/somephoto.jpg}">
<a rel="\textcolor{red}{license}" 
   href="\textcolor{red}{http://creativecommons.org/licenses/by-nc-nd/2.0/}"> 
   Creative Commons license 
</a>. 
If you use this photo within the terms of the license or make special 
arrangements to use the photo, please list the photo credit as 
<span property="\textcolor{red}{cc:attributionName}">\textcolor{red}{Alice}</span> 
and link the credit to <a rel="\textcolor{red}{cc:attributionURL}"
   href="\textcolor{red}{http://flickr.com/photos/alice}"> 
   http://flickr.com/photos/alice 
</a></div>}\end{alltt}
  \caption{License Expressed in RDFa}
  \label{fig-cc-rdfa-license}
\end{figure}

The important things to note in this code snippet is that Alice has given the subject (the URI of the photo) the following properties:
\begin{itemize}
\item \emph{attributionName}: `Alice'
\item \emph{attributionURL}: `http://flickr.com/photos/alice'.
\end{itemize}

If we assume that Bob uses Alice's photo in his blog or some other derivative work without embedding the XHTML that references Alice via the \emph{attributionURL} or the \emph{attributionName}, or does not at least mention the source of the photo - it is a clear violation of Alice's CC license terms.

\section{Data Purpose Algebra}

One of the hazards of combining multiple data sources is that incompatible licenses can get mixed up creating a license that basically freezes the creative process. Take for example a Non-Commercial (NC) license that gets mixed with a Share-Alike (SA) license. A SA license requires that the resulting product be shared under exactly the same conditions as the component product under SA. The resulting license in our scenario becomes NC-SA. But while the result satisfies the first license by also being NC, it fails the second license by not being only SA. So, if the component product under SA was unencumbered by the NC clause, adding it to an NC restricted component violates the SA requirement. Also we cannot simply ignore the NC clause and give the resulting work only the SA license. This is because somebody else might use the resulting derivative work which does not have the NC clause in some commercial use violating the rights of the original creator who composed the NC component. Many of these license combination conflicts are given in Figure \ref{fig-license-matrix}.

It has been shown that it is possible to model data usage policies programmatically by what is known as the \emph{Data Purpose Algebra} \cite{DBLP:conf/policy/HansonBKSW07}, by describing each content item $i$ in a data set, a source or agent that processes the data $Q_{\sc d}(i)$, the category of data $K_{\sc d}(i)$, and its purpose $P_{\sc d}(i)$. When another agent combines two or more data sets, a new data set is created whose content, category and purpose are some function of the agent, content, category and purpose of each of the component data sets. Specifically, if the agent or the source of the new data item is $a'$, the new category becomes the function ${\cal K}(K_{\sc d}(i))$ of the given category, and the allowed purposes of the new data item will be the more complex function ${\cal P}(P_{\sc d}(i), A_{\sc d}(i), a', K_{\sc d}(i))$ that may depend on the original purposes, the agents, and the category of the original data.

We believe that when reusing content on the Web,  the same principle could be applied. For the content item $i$ that is reused, the source function $Q_{\sc d}(i)$ would be represented by the URI of the content. The category of data $K_{\sc d}(i)$ will be represented by the content type (i.e. image, text, video, audio, etc.) that is being reused. The purpose $P_{\sc d}(i)$ will be determined by the CC license associated with it that specifies the allowed uses, restrictions and conditions. Then the function which composes the new set of purposes ${\cal P}(P_{\sc d}(i), A_{\sc d}(i), a', K_{\sc d}(i))$ should generate the XHTML that is required to embed the content with proper attribution.

\section{Inline Provenance using Metadata}

To be useful, metadata need to have three important characteristics: they have to be easy to produce, be embedded within the data they describe, and be easily readable. The easiest way to produce metadata is to have them be produced automatically. Any metadata that has to be produced manually by the user usually doesn't get produced at all. The easiest way to ensure that the link between metadata and the data they describe is not broken is by embedding the former inside the latter. This way, the two travel together inseparably  as a package. 
Finally, metadata have to be accessible easily, readable both manually as well as programmatically. At best, the metadata should be readable by crawlers of various search engines. Since metadata and data are traveling together, if popular search engines such as Google and Yahoo can read the metadata, by default the data become available to anyone who searches for it. RDF \cite{rdf} is the best known metadata format which satisfy all these criteria. It has lot of community support, adopted widely and is a W3C  recommendation.

Extensible Metadata Platform (XMP) \cite{xmp} is a technology that allows one to transfer metadata along with the content by embedding the metadata in machine readable RDF. This technology is widely deployed in embedding licenses in free-floating multimedia content such as images, audio and video on the Web.

Another format which is nearly universal when it comes to images is the Exchangeable Image File format (EXIF) \cite{exif}. International Press Telecommunications Council (IPTC) photo metadata standard \cite{iptc} is also another well known standard. The metadata tags defined in these standards cover a broad spectrum including date \& time information, camera settings, thumbnail for previews and more importantly, the description of the photos including the copyright information. However, these latter two formats do not store these metadata in RDF. They both use key-value pairs.

One major drawback of inline metadata formats such as XMP and EXIF is that, it is embedded in a binary file, completely opaque to nearly all users, whereas metadata expressed in RDFa will require colocation of metadata with human visible HTML. In addition to that, these metadata formats can only handle arbitrary number of properties using RDF predicates and these formats do not seem to take advantage of the rich expressivity offered by RDF.

\begin{landscape}
%\centering     % optional, probably makes it look better to have it centered on the page
\begin{figure}[!h]
  \centerline{\epsfig{file=images/license_matrix_new.jpg, height=4in}}
  \caption{License Composition Matrix for Detecting Share Alike Violations}
  \label{fig-license-matrix}
\end{figure}
The rows represent the resultant license of the composite work and the columns represent the license of a component content item. The \textcolor{green}{\ding{52}} symbol is used when the corresponding resultant license \textbf{\emph{can}} be given to the composite work if one of the components is under the license given by the column. However, if at least one of the cells is given the \textcolor{red}{\ding{56}} symbol, then the resultant license has a conflict with the corresponding component content item, and thus leads to a license violation. 
\end{landscape}

