<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Publishing Frontier &#187; Datamining</title>
	<atom:link href="http://pubfrontier.com/category/datamining/feed/" rel="self" type="application/rss+xml" />
	<link>http://pubfrontier.com</link>
	<description>A raucous public discussion of the publishing revolution.</description>
	<lastBuildDate>Tue, 06 Jul 2010 15:39:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>A reader&#8217;s delight</title>
		<link>http://pubfrontier.com/2008/02/15/a-readers-delight/</link>
		<comments>http://pubfrontier.com/2008/02/15/a-readers-delight/#comments</comments>
		<pubDate>Fri, 15 Feb 2008 21:37:34 +0000</pubDate>
		<dc:creator>Michael Jensen</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[Digital Libraries]]></category>
		<category><![CDATA[Publishing]]></category>
		<category><![CDATA[Repositories]]></category>
		<category><![CDATA[books online]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[open content]]></category>
		<category><![CDATA[serendipity]]></category>
		<category><![CDATA[sergey brin]]></category>

		<guid isPermaLink="false">http://pubfrontier.com/2008/02/15/a-readers-delight/</guid>
		<description><![CDATA[I was Googling for something completely different today, using four terms and a &#8220;quoted phrase,&#8221; and had pared down the jillions to only 38 results. At the bottom of the first page of results was an oddity: My Favorite Books. I happened to notice the url: http://infolab.stanford.edu/~sergey/booklist.html And thought: Stanford, Sergey&#8230;. and clicked on it. [...]]]></description>
			<content:encoded><![CDATA[<p>I was Googling for something completely different today, using four terms and a &#8220;quoted phrase,&#8221; and had pared down the jillions to only 38 results. At the bottom of the first page of results was an oddity: My Favorite Books. I happened to notice the url:</p>
<p><a href="http://infolab.stanford.edu/~sergey/booklist.html" target="_blank">http://infolab.stanford.edu/~sergey/booklist.html</a></p>
<p>And thought:  Stanford, Sergey&#8230;. and clicked on it. And yes, it&#8217;s a 1998 looong list of &#8220;Sergey Brin&#8217;s favorite books,&#8221; from his Stanford days. His 1998 Web page is accessible from that page, where it becomes clear this long list is something he used for &#8220;Extracting Patterns and Relations from the World Wide Web,&#8221; given at the WebDB Workshop at &#8220;EDBT &#8217;98.&#8221; His home page is a charming little thing, fresh with the newness of the Web.</p>
<p>And so now, I link to it here. Not like Sergey needs links &#8212; but it is an example of the &#8220;search net&#8221; phenomenon.  Because I was using <em>four</em> terms <em>and</em> a phrase, my specificity enabled serendipitious discovery, of a substantive chunk of content.<br />
This is worth thinking about by publishers, because increasingly, searchers/researchers are using strategies like I did, to make sense of the density of the underbrush of the abundant Web. If that&#8217;s the case, and if encouraging &#8220;stumbling upon&#8221; our books is a good thing, then it behooves us to make our content indexable, one way or another.</p>
<p>At the National Academies Press site, we include, on the first page of every chapter (the books are presented page-by-page) , the full unformatted text of the first 10 and last 10 pages of that chapter. We include key phrases extracted from the chapter. And by doing this, we provide a huge, juicy target for search engines to slurp up.</p>
<p>Consequently, if someone&#8217;s putting in three, four, or five terms into Google or MSN or wherever, <em>and those terms happen to be in our chapter</em>, then we&#8217;ll show up in the search results, and get that wee bit of traffic. And a wee bit of opportunity to sell that book to someone who might be interested in it (note: only 0.24% of visitors  currently buy anything from our site).</p>
<p>But those terms would almost certainly <em>not all </em>be in the book&#8217;s metadata, or in the publisher&#8217;s catalog blurb, or in the table of contents.  It&#8217;s only openly indexable content that will provide a big enough pool of possibilities to match ever-more-esoteric and -specific search strategies: &#8216;net casting of a paragraph or a document, selectable groups of terms, phrase-pair searches, etc.</p>
<p>I&#8217;m still convinced that for small-market publications in particular &#8212; the kinds of books that are generally hard to justify significant promotion of &#8212;  openly indexable content is a precondition for survival, in terms of long-tail backlist success in the scholarly environment. People find something, link to it, and thus promote it for free, <em>for</em> us, in the venues that care about that publication.</p>
<p>This theme pertains a bit to my comment on Joe&#8217;s <a href="http://pubfrontier.com/2008/01/15/the-baby-and-the-bath-water/">Baby and Bathwater</a> post, on the University of Pittsburgh Press&#8217;s digital library experiment, though alas, I don&#8217;t think that UPP&#8217;s library provides any indexable content. Even rough OCR would help, and I hope it&#8217;s part of their plan, eventually.</p>
]]></content:encoded>
			<wfw:commentRss>http://pubfrontier.com/2008/02/15/a-readers-delight/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Anonymous is not always</title>
		<link>http://pubfrontier.com/2007/12/23/anonymous-is-not-always/</link>
		<comments>http://pubfrontier.com/2007/12/23/anonymous-is-not-always/#comments</comments>
		<pubDate>Sun, 23 Dec 2007 22:29:56 +0000</pubDate>
		<dc:creator>Peter Brantley</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[Privacy]]></category>
		<category><![CDATA[eBooks]]></category>

		<guid isPermaLink="false">http://pubfrontier.com/2007/12/23/anonymous-is-not-always/</guid>
		<description><![CDATA[Recently, Netflix released some anonymized usage data in order to seed a technical challenge (on recommending algorithms). Bruce Schneier, a well known security expert, reports that a team of University of Texas researchers successfully de-anonymized a subset of the data through correlation with public IMDb (internet movie database) entries. Bruce extends this by analogy to [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, Netflix released some anonymized usage data in order to seed a technical challenge (on recommending algorithms).</p>
<p>Bruce Schneier, a well known security expert, <a href="http://www.schneier.com/blog/archives/2007/12/anonymity_and_t_2.html" title="Schneier on de-anonymizing data">reports that a team</a> of University of Texas researchers successfully de-anonymized a subset of the data through correlation with public IMDb (internet movie database) entries.</p>
<p>Bruce extends this by analogy to point how straightforward these forms of datamining are, and he notes the obvious analogy to book purchasing habits:</p>
<blockquote><p>[A]s opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.</p>
<p>Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants&#8217; telephone order database. Or Amazon&#8217;s online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.</p>
<p>Google, with its database of users&#8217; internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine&#8217;s data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.</p>
<p>What the University of Texas researchers demonstrate is that this process isn&#8217;t hard, and doesn&#8217;t require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.</p>
<p>. . .</p>
<p>With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://pubfrontier.com/2007/12/23/anonymous-is-not-always/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
