About eBio SURF

eBio Summer Undergraduate Research Fellowships 2009

TOPIC AREAS & PROPOSED RESEARCH

Information Technology

#1. What’s Your Address? In 1790? With GPS? This project entails building a master list of every valid postal address in the United States with its GPS coordinates such that the distance between any two addresses can be determined. Sounds simple? It’s not.

The research fellow will first explore the obvious option – buy the list and likely conclude that it is not possible, but perhaps we will be surprised. We have determined an algorithm for creating the list by cross matching multiple databases and feeding the results to GPS query software. This project requires automation skills – somewhat akin to screen scraping, but on a vast scale.

We left out one piece: We would like that list for 1790 and 2009, including the GPS coordinates for the 1790 addresses. Essentially, we are looking for all current addresses and their genealogy. A large parcel of land in 1790 may have become four farms in 1810 and is a subdivision with a thousand homes in 2009. We are looking for the best method to establish the current list and trace the addresses' genealogy as best as possible.

#2. OCR – One step forward. The project objective is to advance the state of the art in Optical Character Recognition. You’ve likely heard the statistics. It’s 99.98% accurate. Perhaps with the best scan and optimal text layout, but in the real world it is less.

eBio requires 100% accuracy and we have developed a methodology that achieves it for the type of material that is of the greatest interest in our applications. The research fellow will be given the methodology and asked to determine the best method to automate it, so that it can be used routinely. The researcher will also attempt to prove out the methodology in a variety of test scenarios.

The skill-sets necessary are an understanding, or willingness to learn, about the various commercial OCR packages and their limitations and optimal settings. The ability to manipulate different XML formats into a standard format is also required, as is the ability to design an intuitive interface. For the project, the research fellow will need to sign an additional non-disclosure agreement – specifically relating to disclosure of the methodology involved.

#3. Digitizing Microfilm with Improved Metadata. Much of the best archival material in the world is stored exclusively on microfilm. In a preservationist world, it is the format with the longest lifespan. Ask anyone with WordStar files from 1980. Digital data is more perishable than one might think. That said, microfilm doesn’t exactly facilitate research, so it is sometimes digitized so that OCR can be performed and entire archives can be searched for an occurrence of a word rather than spending months at a microfilm reader going through issue after issue of a daily newspaper. But the state of the art is not very good.

Our pilot project entails the digitization of a few, now defunct, small town newspapers that only exist on microfilm in state archives. We need more than OCR. We need to create metadata that gives the query results context. At a minimum, we need article recognition – the ability to index a paper such that a link will go to a specific pixel in a PDF. To understand the issue, the research fellow might go to the New York Times web site and query the obituary archives for a common name. It will often return results labeled “Deaths” or perhaps no label at all. To find the instance of the name for which you are searching, you need to painstakingly search through small fonts that are not recognized by standard search functions.

Digitization of the microfilms themselves is often tricky. We seldom have the luxury of using a master. More likely, we are working with second or third generation copies. The research fellow will explore ways to get optimal results given what we have and what we need, and analyze the various business models for getting it done. All necessary logistical support will be provided.

#4. Integrated email archive tool. One of our current needs is to build an outlook plug-in that will convert emails to PDFs and automatically direct them to an unusual archive without the presence of any other third party software that must be licensed on an individual basis. This project requires an understanding of PDF file structures and their creation, the design of plug-ins, user interface best practices, as well as internet file transfer protocols and security.

#5. Deep Web Mining. Much of our work involves deep web mining of public domain databases. It requires creativity, sleuthing to understand site structures and limitations, as well as legal limitations that we impose on ourselves. All copyrights and/or fair usage rules are respected. We don’t open closed doors, or even doors that are partially closed. Our list of sites that need to be mined is endless and perpetual. That is the challenge for the research fellow. Mining a site once is relatively easy. Designing systems that detect changes or re-mine on a periodic basic in a scalable fashion is a challenge. We are open to suggestions and experimentation.

For example, we might want to mine a database of public property records…in Paris...to match against 15,000 Parisian addresses from 1942 as best as possible. Those are the unusual challenges we face every day.

#6. Your own Topic. Pick a topic of your own interest that illustrates a way to resolve some of the technical issues described throughout these research topics, or just pick one that improves the state of the art of some facet of Information Technology that might interest us or benefit the public good. We encourage explorers. Someone once said “Basic Science is when I don’t know what I am Researching”. We are open to that kind of open exploration.

eBio Summer Undergraduate Research Fellowship Program

eBio Summer Undergraduate Research Fellowships 2009

TOPIC AREAS & PROPOSED RESEARCH

Information Technology