Reality Check Ahead: Data Mining and the Implications for Real Estate Professionals

MLS is a 100-year old institution that expertly aggregates and houses most, if not all, of real estate’s most critical data. Today, our data is currently being leveraged, sourced, scraped, licensed and syndicated by a grand assortment of players, partners and members. It’s being utilized in ways never imagined just a decade ago. Or, for that matter, six months ago.

The result: a plethora of competitive, strategic, financial and security-based issues have surfaced that challenge every MLS, as well every single one of our members/customers.

I think about this all the time. During my recent visit with my son KB – a college junior – he told me about how Google recently came to his campus offering everyone free email, voice mail, Docs (to replace MS Office) and data storage – an impressive list of free services for all.

I asked him why this publically traded company would give away its products for free. Despite his soaring IQ and studies in information systems technology, he couldn’t come up with an answer.

Searching Google on my laptop I presented KB with the following Google customer email (September, 2009) that read: “We wanted to let you know about some important changes … in a few weeks, documents, spreadsheets and presentations that have been explicitly published outside your organization and are linked to or from a public website will be crawled and indexed, which means they can appear in search results you see on Google.com and other search engines.” Note: once data is available on Google searches, their business model calls for selling advertising around that search result.

Bear in mind this refers to published docs and not those labeled as private – a setting within Google Docs that of which not all users are aware.

I also presented him with the specific EULA (End-User Licensing Agreement) language that states how a user grants a “perpetual, irrevocable, royalty free license to the content for certain purposes (republication, publication, adaptation, distribution), extending to the provision of syndicated services and to use such content in provision of those services.”

 

I recounted for KB how back in March of 2010, we learned in the national news that: “A confidential, seven-page Google Inc. “vision statement” shows the information-age giant is in a deep round of soul-searching over a basic question: How far should it go in profiting from its crown jewels—the vast trove of data it possesses about people’s activities?”

Source: Wall Street Journal August 10, 2010

This chart above shows that nearly 85% of respondents are concerned about the practice of tracked online behavior by advertisers.

Then, a Wall Street Journal article titled “What They Know” was posted which discusses how companies are developing ‘digital fingerprint’ technology to track our use of individual computers, mobile devices and TV set-top boxes so they can sell the data to advertisers. It appears that each device broadcasts a unique identification number that computer servers recognize and, thus, can be stored in a database and later analyzed for monetization. This 3-minute video is a must-see!

By the way, they call this practice “Human Barcoding.” KB began to squirm. As we all should.

 

Data. Security. And real estate

So what do “innovative” data mining and monetization methods now in use by Google and others, mean to real estate – specifically the data aggregated by an MLS and then shared around the globe?

We all must first grasp what happens to listing data when it’s collected and syndicated into “the cloud”, as well as the human transaction interactions that follow from start to finish (and beyond, actually).

Second, we need to understand how business intelligence and analytics are being applied to the data generated by real estate transactions today. If there is a monetization to the data without the knowledge and permission of the rightful owner, then, potentially, agreements need to be negotiated (or renegotiated) and modified to get in step with today’s (and tomorrow’s) inevitable ways of doing business. I’m not in any way opposed to data mining per se, the issue at hand here is fair compensation for the data on which it is based.

Here’s why the latest developments regarding Google (and others) are vitally important:

 

  • The world of leveraging digital information is changing very rapidly. As businesses push harder and deeper in their quest to monetize data, information, bits/bytes and mouse clicks, we must establish clear and informed consent on who exactly owns the data, who should control it and how it should be monetized. Protecting OUR “crown jewels”, if you will.
  • What do you know about “Human Barcoding”? It’s time for industry leaders to research this new phenomenon and begin to establish the basis for an industry position as it pertains to residential real estate.
  • How do we, as an industry, determine the real value of data beyond the property-centric context? As true business intelligence and data mining progress in our industry, we need “comps” to build upon to derive a valuation model.
  • What exactly is the MLS’s role? Are we the “stewards” of the data (on behalf of our customers) that emanates from the property record and the subsequent transaction and electronic interactions between all the parties connected to it?  How should the MLS industry confront the challenge?

We all certainly remember when the national consumer portals planted their flag(s) on this industry and, by association, MLS territory. Their rationale then was that they would help drive “eyeballs” and traffic to the inventory. Indeed they have. But, looking back, it all came with a pretty steep price tag.

For example, referral fees were subsequently replaced with advertising revenues that more often than not started chipping away at the edges of the broker’s affiliated business models (mortgage, insurance, etc). Now, as a result, the margins of the business are perilously thin from a broker’s perspective.

The roots of the MLS began as a business to facilitate a fair distribution of commissions and compensation amongst brokers. It’s safe to say, dear Toto, that we are no longer in Kansas anymore. Given the digital landscape, where value can be derived in so many unique ways, the fact that others whose motives for increasing the value of the asset are potentially suspect, it’s critical that we convene right now to assert an intellectual lead on what is happening here, or at least make the conscious decision to step aside.

I’m sure there are many other questions and reasons why this is “mission critical” to us. But what I’ve offered, with the help of several really smart folks in the industry, provides a good starting point. We welcome all industry commentators on this topic. Thanks in advance for sharing ….

John L. Heithaus Chief Marketing Officer, MRIS (john.heithaus@mris.net)

Ps – a “tip of the hat” to Greg Roberston of Vendor Alley for starting us on this path after his excellent post “Inside Trulia’s Boiler Room”*. I also benefited mightily from the comments of David Charron of MRIS, Marilyn Wilson of the WAV Group and Marc Davison of 1000watt Consulting, and I extend my appreciation to them for sharing their perspectives.

* After this story ran, the You Tube video interview with a Trulia staffer was made “private” and is now inaccessible. Vendor Alley’s analysis of the video provides an excellent overview of the situation.

 

Journalism in the Age of Data

Journalists are coping with the rising information flood by borrowing data visualization techniques from computer scientists, researchers and artists. Some newsrooms are already beginning to retool their staffs and systems to prepare for a future in which data becomes a medium. But how do we communicate with data, how can traditional narratives be fused with sophisticated, interactive information displays?

Watch the full version with annotations and links at datajournalism.stanford.edu.

Produced during a 2009-2010 John S. Knight Journalism Fellowship at Stanford University.

Qwiki Alpha Launch - Tigho

I just was invited to Qwiki Alpha, the interactive, "information experience' platform that creates multimedia-rich wikis algorithmically out of data sets, instead of by user input and peer review!

Despite simply being fascinating and amazingly cool, Qwiki has profound implications upon the future of search and data organization.

Please note that according to Qwiki:

1. This experience was not generated by humans. It was generated by machines.

2. This experience is completely curated.

3. The experience is completely interactive

I had written about how exited I was about the possibilities of Qwiki a few weeks ago, but apparently they have now launched the Alpha version to select users. Also on Friday, Qwiki also announced a round of funding from tech-celebs Eduardo Saverin (Facebook co-founder) and Jawed-Karim (YouTube co-founder).

I cannot wait to test this out more, but see below for an entirely computer generated entry about the word "Wiki."

For All Its Flaws, Wikipedia is the Way Information Works Now

Wikipedia, which turns 10 years old this weekend, has taken a lot of heat over the years. There has been repeated criticism of the site’s accuracy, of the so-called “cabal” of editors who decide which changes are accepted and which are not, and of founder Jimmy Wales and various aspects of his personal life and how he manages the non-profit service. But as a Pew Research report released today confirms, Wikipedia has become a crucial aspect of our online lives, and in many ways it has shown us — for better or worse — what all information online is in the process of becoming: social, distributed, interactive and (at times) chaotic.

 

According to Pew’s research, 53 percent of American Internet users said they regularly look for information on Wikipedia, up from 36 percent of the same group the first time the research center asked the question in February of 2007. Usage by those under the age of 30 is even higher — more than 60 percent of that age group uses the site regularly, compared with just 33 percent of users 65 and older. Based on Pew’s other research, using Wikipedia is more popular than sending instant messages (which less than half of Internet users do), and is only a little less popular than using social networking services, which 61 percent of users do regularly.

The term “wiki” — just like the word “blog,” or the name “Google” for that matter — is one of those words that sounds so ridiculous it was hard to imagine anyone using it with a straight face when Wikipedia first emerged in the early 2000s. But despite a weird name and a confusing interface (which the site has been trying to improve to make it easier to edit things), Wikipedia took off and has become a powerhouse of “crowdsourcing,” before most people had even heard that word. In fact, the idea of a wiki has become so powerful that document-leaking organization WikiLeaks adopted the term even though (as many critics like to point out) it doesn’t really function as a wiki at all.

Most people will never edit a Wikipedia page — like most social media or interactive services, it follows the 90-9-1 rule, which states that 90 percent of users will simply consume the content, 9 percent or so will contribute regularly, and only about 1 percent will ever become dedicated contributors. But even with those kinds of numbers, the site has still seen more than 4 billion individual edits in its lifetime, and has more than 127,000 active users. Those include people like Simon Pulsifer, once known as “the king of Wikipedia” because he edited over 100,000 articles. Why? Because that was his idea of fun, as he explained to me at a web conference.

Yes, there will always be people who decide to edit the Natalie Portman page so that it says she is going to marry them, or create fictional pages about people they dislike. But the surprising thing isn’t that this happens — it’s how rarely it happens, and how quickly those errors are found and corrected.

With Twitter, we are starting to see how a Wikipedia-like approach to information scales even further. As events like the Giffords shooting take hold of the national consciousness, Twitter becomes a real-time news service that anyone can contribute to, and it gradually builds a picture of what has happened and what it means. Along the way, there are errors and all kinds of other noise — but over time, it produces a very real and human view of the news. Is it going to replace newspapers and television and other media? No, just as Wikipedia hasn’t replaced encyclopedias (although it has made them less relevant).

That is the way information works now, and for all their flaws, Wikipedia and Jimmy Wales were among the first to recognize that.

-via gigaom.com

And the Smartest Site on the Internet Is...

Mims's Bits

Google now lets you filter sites by "reading level".

The internet used to be full of highbrow reading material, until broadband penetration exploded and everyone with a credit card managed to find his or her way onto the web. Finding your way back to the rarefied air that used to suffuse the 'net can be a slog, so Google has a new way to help you out: You can now sort sites by reading level.

(For those of you following along at home, under Google's 'advanced' search, simply switch on this option by hitting the dropdown next to "Reading level.")

The results are fascinating. Searching for any term, no matter how mundane, and then hitting the "advanced" link at the top strips away all the spam, random blogs and all the rest of the claptrap from the advertisers, hucksters and mouthbreathers.

This is only one of the varieties of elitism enabled by the new feature, which was created by statistically analyzing papers from Google Scholar and school teacher-rated webpages that are then compared to all the other sites in Google's index.

As pioneered by Adrien Chen of Gawker, by far the most interesting application of the tool is its ability to rate the overall level of material on any given site, simply by dropping site: [domain.com] into the search box.

By this measure, the hallowed halls of the publication you're reading now fare pretty well:

Not quite as well as some sites that share our audience:

But certainly better than certain other, decidedly middlebrow, publications:

It's when you turn to the scientific journals that the competition really heats up:

And the battle between traditional and open access publishing models takes on a new dimension:

(Just for reference, Here's how MIT itself performs)

And, much as I'm loathe to admit it, the smartest site on the Internet is...

Meanwhile, excluding sites aimed at children, here's the dumbest:

-via technology review

FTC Considers Do-Not-Track List 07/28/2010

The Federal Trade Commission is considering proposing a do-not-track mechanism that would allow consumers to easily opt out of all behavioral targeting, chairman Jon Leibowitz told lawmakers on Tuesday.

Testifying at a hearing about online privacy, Leibowitz said the FTC is exploring the feasibility of a browser plug-in that would store users' targeting preferences. He added that either the FTC or a private group could run the system.

Leibowitz said that while Web users on a no-tracking list would still receive online ads, those ads wouldn't be targeted based on sites that users had visited in the past.

Three years ago, a coalition of privacy groups including the World Privacy Forum, Center for Digital Democracy and Center for Democracy & Technology proposed that the FTC create a do-not-track registry, similar to the do-not-call registry. At the time, the online ad industry strongly opposed the idea of a government-run no-tracking list.

Currently, many people who want to opt out do so through cookies, either on a company-by-company basis or through the Network Advertising Initiative's opt-out cookie (which allows users to opt out of targeting from many of the largest companies). But those opt-outs aren't stable because they're tied to cookies, which often get deleted.

The Network Advertising Initiative recently rolled out a browser plug-in that enables consumers to opt out of targeted ads by NAI members.

Leibowitz also told lawmakers that he personally favored opt-in consent to behavioral targeting, or receiving ads based on sites visited. "I think opt-in generally protects consumers' privacy better than opt-out, under most circumstances," he said. "I don't think it undermines a company's ability to get the information it needs to advertise back to consumers."

Online ad companies say that behavioral targeting is "anonymous" because they don't collect users' names or other so-called personally identifiable information, but Leibowitz said that it might be possible to piece together users' names from clickstream data. He told lawmakers about AOL's "Data Valdez," which involved AOL releasing three months' of "anonymized" search queries for 650,000 users. Even though the company didn't directly tie the queries to users' names, some were identified based solely on the patterns in their search queries. Several lawmakers expressed concerns with behavioral advertising during Tuesday's hearing. Sen. Claire McCaskill (D-Mo.) said she was "a little spooked out" about online tracking and ad targeting.

McCaskill said that after reading online about foreign SUVs, she noticed that she was receiving ads for such cars. "That's creepy," she said, likening it to someone following her with a camera and recording her moves.

She added that if an "average American" were to learn that someone was trailing him around stores with a camera, "there would be a hue and cry in this country that would be unprecedented."

Sen. Jay Rockefeller (D-W. Va.) and Sen. John Kerry (D-Mass.) both expressed concern that privacy policies weren't giving Web users enough useful information about online ad practices.

Rockefeller proposed that some companies were burying too much information in lengthy documents that consumers don't read. "Some would say the fine print is there and it's not our fault you didn't read it," he said, adding, "I say, that's a 19th-century mentality."

Kerry added that he didn't know that consumers understood how companies use data. "I'm not sure that there's knowledge in the caveat emptor component of this," he said.