\chapter{License Violations Assessment on the WWW}
\label{assesment}

\section{Experiment Setup}

\begin{figure}
  \centerline{\epsfig{file=images/experiment_results.jpg, height=3.5in}}
  \caption{Results from the Experiment}
  \label{fig-experiment}
\end{figure}

The goal of the experiment is to obtain an estimation for the level of CC attribution license violations on the Web. Since Flickr has over 100 million Creative Commons Licensed images (as of April 2009), detecting attribution license violations with Flickr images seem to be a good way of getting an approximate measurement of the level of license violations out there on the Web. Therefore, specifically, our task in the experiment is to gather quantitative evidence of attribution license violations for several samples of sites that embed Flickr images.

\subsection{Ensuring a Fair Sample}

The Technorati blog indexer \cite{technorati} crawls and indexes weblog-style Web sites gathering lots of information. It keeps track of articles on the Web site, what links to it, what it links to, how popular it is, how popular the Web sites that link to it are, how popular the people that read it are, and so on. Most importantly all the technorati data are time dependent, which means that the technorati \emph{authority rank} \footnote[1]{ Authority Rank is a measurement that determines the top `n' number of results from any query to the Technorati API.} is based on most recent activity in a particular Web site. 

Web sites used in this experiment were obtained through the \emph{Technorati Cosmos} method . The cosmos method can be used to retrieve results for Web sites linking to a given base URI.
Therefore, to obtain samples for the experiment, several of the Flickr server farm URIs that have the following general format \cite{flickr_url} were used.
\begin{verbatim}Òhttp://farm<farm-id>.static.flickr.com/<server-id>/<id>_<secret>.
(jpg|gif|png)Ó
\end{verbatim} 

Since Flickr has several server farms, to obtain a fair sample each time the experiment was run, the base URIs were randomly  generated by altering the Flickr server farm ids.  In addition to that, randomness of the samples was guaranteed by running the experiment   after a small time gap (for e.g.  a week or two). This is because the \emph{authority rank} given to a web site by Technorati, and hence the results returned from the Cosmos method  dynamically changes as new content gets created. The links in the Technorati   Cosmos are only valid for 180 days, and if there are no fresh links coming in to a site regularly, the rank goes down changing the result set returned. Therefore,   this factor was also used in generating a random sample of Web sites to check for attribution license violations. 

\subsection{Checking for Attribution}
\label{check-attr}

After a sample was collected, attribution for each of the images embedded in these sites  were checked using few heuristics. Since Flickr is still using the older CC 2.0 recommendation, Flickr users do not have that much flexibility in specifying their own \emph{attributionURL} or the \emph{attributionName} values to state how they would like attribution to be given to them. However, it is considered general practice to give attribution by  linking to the Flickr user profile or give the Flickr user name  (which could be interpreted as the \emph{attributionURL} and the \emph{attributionName} respectively), or by the least, point to the original source of the image \cite{credit-on-flickr}. Therefore, the criteria for checking attribution consist of looking for the \emph{attributionURL}  or the \emph{attributionName} or any \emph{source citations}  within a reasonable level  of scoping  from where the image is embedded in the Document Object Model (DOM). 

%The algorithm used for checking attribution is as follows:

%\begin{algorithmic}[1]
%\STATE Collect a random sample of $WebSites$ which link to Flickr farm URIs
%\FORALL {$WebSites$}
%\STATE Find the $Images$ in the DOM linking to a Flickr image URI
%\FORALL {$Images$}
%\STATE $ID \leftarrow \mbox{Extract image id from the Flickr image URI}$
%\STATE Use $ID$ to query Flickr
%\STATE $License \leftarrow \mbox{License Deed Information}$
%\STATE $AttrbutionName \leftarrow \mbox{Flickr User Name}$
%\STATE $AttributionURL \leftarrow \mbox{Flickr User Profile URI}$
%\IF {Parent or the Sibling DOM elements of the Current Image links to $License$ and mentions $AttributionName$ or $AttributionURL$}
%\STATE  $Attribution \leftarrow \mbox{Given}$
%\ELSE
%\STATE $Attribution \leftarrow \mbox{Not Given}$
%\ENDIF
%\ENDFOR
%\ENDFOR
%\end{algorithmic}

\section {Results}

The results from 3 samples of websites gathered within 2 week intervals are shown in Table 3.1. %\ref{tbl-results}.  --- Dunno why this doesn't work!
These results have misattribution rates ranging from 78\%  to 94\%. Figure \ref{fig-experiment} illustrates the results from few sample runs of the experiment. The overall summary for each of the samples include:
\begin{itemize}
\item Total number of Web sites tested
\item Total number of images in all of the Web sites
\item Total number of properly attributed images in all of the Web sites
\item Total number of mis-attributed images
\item Total number of instances that led an error (due to bad HTML, parsing errors, Flickr errors, etc)
\end{itemize}

Using these values, the misattribution percentage for each sample is calculated. Also, for each of the offending sites, the interface gives the non-attributed or the mis-attributed image, the original owner of the image and the license it is under.

These results indicate that there is a strong need to have awareness among reusers of content to check and honor the licenses associated with the content. However, although the misattribution rates in these samples seem to have a very high value, it should be kept in mind that the sample consists of only Web sites that have Flickr images embedded with a Flickr farm URI, and not a sample of general Web sites. 

\begin{table}[!h]
\label{tbl-results}
\begin{center}
    \begin{tabular}{ | l | l | l |}
    \hline
    \# of Websites & Total \# of Images & Misattribution   \\ \hline
    67  & 426  & 78 \% \\ \hline
    70 & 241 & 80 \%\\ \hline
    70 & 466 & 94 \%\\ \hline
    \end{tabular}
\end{center}
\caption{Attribution License Violations Rates of the Experiment Samples}
\end{table}

\subsection{Issues in this Experiment}
\label{experiment-issues}

\subsubsection{No Self-Attribution}

The results from the experiment includes cases where users have not attributed themselves: i.e. user uploads her photos on Flickr, and uses those in the user's own blog or Web site. Since those are user's own photos, she is under the assumption that there is no need to attribute herself. This assumption is not entirely valid as the CC BY license deed \cite{by-legal-code} specifies: \emph{``If You Distribute you must keep intact all copyright notices for the Work and provide (i) the name of the Original Author (or pseudonym, if applicable) ... "}. This means that, if there is a license attached with the original content, the original user will become the reuser, and therefore will have to honor the license even though it is imposed by herself. This might seem absurd since it should not matter to her if she violates her own license terms. However if the user gives attribution to herself, it would in fact guide other people who want to reuse the content in that secondary work. Therefore, by not attributing herself, the user may be violating her own rights in the long run.  

A solution to this issue is hard to realize, as it is difficult to infer the Web site owner from the data presented in the Web site. Even if that was possible, it is hard to make a correlation between the Flickr photo owner and the Web site owner. For example in Figure \ref{fig-experiment_sample2}, the first attribution violation result shows photos from the Flickr users `Tambako the Jaguar' and `Arne List'. People often assume pseudonyms on the Web, and these two users might in fact be the same person, and the Web site where these particular photos are embedded may belong to the same person. Since these connections are not explicitly stated in a machine readable format such as RDF, it becomes very hard to determine the real owners of the image and the Web site programmatically.

\begin{figure}[!h]
  \centerline{\epsfig{file=images/experiment.jpg, height=3.5in}}
  \caption{Example Result from the Experiment}
  \label{fig-experiment_sample2}
\end{figure}

\subsubsection{Location of the Attribution}

Majority of the Web sites crawled and examined in this experiment have not used ccREL in marking up attribution. Therefore, we used a heuristic to check for the existence of attribution. This heuristic includes the \emph{attributionName} (constructed from the Flickr user name) or the \emph{attributionURL} (constructed from the Flickr user profile URI or the original source document's URI). Visually this would correlate to including the attribution information immediately after the content that is being attributed as shown in Figure \ref{fig-attribution_anamoly} Attribution Example 1. However, since there is no strict definition from CC as to how attribution should be scoped, someone could also attribute as shown in Figure \ref{fig-attribution_anamoly} Attribution Example 2 or it could be even buried within the text in the document. This experiment only considers the types of attributions as given in the first category. The rationale behind this assumption is that, it is possible that the user intended to include more than one work from the same original content creator, and by mistake, failed to attribute some, but attributed some others. However, we discovered that different people use different levels of attribution scoping making it hard to detect the proper attribution in the HTML.  

\begin{figure}[!h]
  \centerline{\epsfig{file=images/attribution_anamoly_example.jpg, height=2.5in}}
  \caption{Different Ways of Attribution}
  \label{fig-attribution_anamoly}
\end{figure}

\subsubsection{Blog Aggregators Ignoring Attribution Information}

Tumble-logs such as tumblr.com cuts down the text and favors short form, mixed media posts over long editorial posts. Use of such blog aggregators is another related problem in getting an accurate assessment of attribution license violations. For example, in a blog post where a photo was reused, the original owner of the photograph may have been duly attributed. But when the tumble-log pulls in the feed from that post in the original Web site and presents the aggregated content, the attribution details may be left out. This problem is difficult to circumvent, because there is no standard as to how aggregation should happen with the license and attribution details. Hence detecting such cases also becomes difficult.

\subsection{Refining the Results}

In order to validate the results against the issues mentioned in Section \ref{experiment-issues} the samples of Web sites used in the experiment were inspected manually to determine how many images were incorrectly identified as misattributed or non-attributed images and then the precision of the result from the sample was calculated by using the following formula:
\begin{equation}
Precision = \frac{\mbox{Correctly Identified Misattributed Images } \bigcap \mbox{Total Images Retrieved}}{\mbox{Total Images Retrieved}}
\end{equation}

The results from the three samples are given in Table 3.2. This indicates a relatively low precision value, which means there are lot of false positives. These are mainly due to the fact that people do not attribute themselves when reusing their own content. Note that calculating the \emph{recall} values for this experiment was not possible due to the manner in which the random samples were generated. The samples are obtained from the Web, where there is obviously no correct estimate of how many real attribution license violations are there.

\begin{table}[!h]
\label{tbl-precision-recall}
\begin{center}
    \begin{tabular}{ | l | l | l |}
    \hline
    Sample & Correctly Identified Images & Precision   \\ \hline
    1  & 183  & 55 \% \\ \hline
    2 & 113 & 42 \%\\ \hline
    3 & 268 & 39 \%\\ \hline
    \end{tabular}
\end{center}
\caption{Precision Values of the Experiment Samples}
\end{table}


\section {Extensions to this Experiment}
\label{expr-extensions}

\begin{figure}[!h]
  \centerline{\epsfig{file=images/chart.jpg, height=3.5in}}
  \caption{Attribution Violations and Precision}
  \label{fig-chart}
\end{figure}

This experiment can be extended to check for Non Commercial use violations, Share Alike use violations and finding out license conflicts in the composite work. 

 In terms of Non Commercial (NC) use, the CC deed specifies that a license including the NC term may be used by anyone for any purpose that is not \emph{``primarily intended for or directed towards commercial advantage or private monetary compensation"}. However, this definition can be vague in certain circumstances. Take for example the case where someone uses a CC-BY-NC licensed image in her personal blog properly attributing the original content creator. The blog is presumably for non commercial use, and since she has given proper attribution it appears that no license violation has occurred.  However there might be advertisements in the page that are generated as a direct result of the embedded image. Our user might or might not actually generate revenue out of these advertisements. But if she does, it could be interpreted as a `private monetary compensation'. Hence we believe that the perception as to what constitutes a Commercial Use is subjective. CC recently conducted an online user survey to gather general opinions as to what people perceive a commercial use is \cite{nc_user_study}. An important finding from this survey is that 37\% of the creators who make money from their works do so indirectly through advertising. However there aren't any clear cut definitions of a non commercial use yet to find out violations and gather experimental results.


Share Alike license violations occur when some content is reused with a conflicting license or remixed with content that have conflicting licenses. The CC survey \cite{nc_user_study} states that the most popular CC license of the respondents is BY-NC-SA. Therefore, it would be interesting to have an estimation of SA license violations as well. Figure \ref{fig-license-matrix} illustrates possible cases when a license conflict might occur. For example, consider the case where the resultant license given to some composite work is BY-NC-SA, and that the two components it is composed of are licensed under BY-NC and and BY-ND-SA respectively. The BY-NC component could be used in the work as the resultant license also has the BY-NC term. However the other component which has the BY-ND-SA specifies that if it is to be used in some other work, the same license has to be given to the composite work. BY-ND-SA is not the same as BY-NC-SA, therefore this leads to a license violation. 