New Library 2.0 Gang Podcast

I posted about the new Library 2.0 Gang Podcast a little while ago only to find that I had jumped the gun. Now it is really really available – so check it out – subscribe – and listen often :)

You can listen to it via Library Journal or the new Library 2.0 Gang page hosted by Talis.

In this issue, we spoke with Aaron Swartz about the Open Library and other Code4Lib conference topics. You can check out my blog post summarizing what Aaron spoke about at the conference if you want more information.

Technorati Tags: , , , ,

Making Digital Archives Accessible

Someone sent me a neat article from the New York Times today. The article talks about how Sports Illustrated is opening up it’s archives for anyone to search. The article implies that this is the way all print publications will probably go to keep their audiences. I’d rather it be that they’re doing it to provide everyone with free access to information – but I guess I’ll take it anyway we can get it.

Publications are rediscovering their archives, like a person learning that a hand-me-down coffee table is a valuable antique. For magazines and newspapers with long histories, especially, old material can be reborn on the Web as an inexpensive way to attract readers, advertisers and money.

Sports Illustrated, which faces fierce daily, even hourly, competition with ESPN, Yahoo Sports and others, has something its main rivals do not: a 53-year trove of articles and photos, most of it from an era when the magazine dominated the field of long-form sports writing and color sports photography.

On Thursday, the magazine will introduce the Vault, a free site within that contains all the words Sports Illustrated has ever published and many of the images, along with video and other material, in a searchable database.

Rip this book?

Just the title makes me cringe! In this case, the word “rip” is used in the same way we use it when we refer to copying CDs:

Could the publishing industry get Napsterized? That was my first thought when I saw the marketing materials for the Atiz BookSnap, the first consumer device that enables you to “release the content” of your books by transforming the printed words on the page into digital files that can be read on computers and handheld e-readers. “It’s not a scanner,” proclaims a banner on the Atiz Web site. “It’s a book ripper.” Though ripping (which means transferring content from an external medium to your computer) does not necessary imply an act of piracy, I couldn’t help but wonder whether this was a sign of impending apocalypse on Publishers’ Row, a scenario that could end up with people file-sharing John Grisham’s latest they way they do now with the newest Vampire Weekend tunes.

Steven Levy writes about how this machine is still way too cumbersome and pricey for the average book owner, but worries that its existence is a sign of what’s to come.

While I’d love to be able to digitally search my book collection … I think I’ll wait for the API that merges data from LibraryThing (which has my entire collection cataloged) and the Open Library (which aims to have scans of every book).

Code4Lib 2008: The Internet Archive

What a great way to open a conference like Code4Lib. The first keynote was presented by Brewster Kahle of the Internet Archive.

Brewster started by reminding us that the reason he was there talking to us and the reason he is working on the Internet Archive is because the library metaphor easily translates to the Internet – as librarians we’re paid to give stuff away! We work in a $12 billion a year industry which supports the publishing infrastructure. With the Internet Archive, Brewster is not suggesting that we spend less money – but that we spend it better.

He started with a slide of the Boston Public Library which has “Free to All” carved in stone. Brewster says that what people carve in stone is taken seriously – and so this is a great example of what libraries stand for. Our opportunity now is to go digital. Provide free digital content in addition to the traditional content we have been providing. I loved that he then said that this is not just a time for us to be friendly together as librarians – but to work together as a community and build something that can be offered freely to all!

He went on to say that what happens to libraries is that they burn – they tend to get burned by governments who don’t want them around. The Library of Alexandria is probably best known for not being here anymore. This is why lots of copies keeps stuff safe. Along those lines, the Internet Archive makes sure to store their data in mirror locations – and by providing information to the archive we’re ensuring that our data is also kept safe and available. This idea of large scale swap agreements (us sharing with the Internet Archive, us sharing with other libraries, etc) in different geographical regions finds us some level of preservation.

How it started

The internet archive started by collecting the world wide web – every 2 months taking a snap shot of the web. Brewster showed Yahoo! 10 years ago – ironically a bit of data that even Yahoo! didn’t have – so for their 10 year anniversary they had to ask the Internet Archive for a copy of what their site looked like! He showed us the first version of Code4Lib’s site and exclaimed “Gosh is that geeky!” because it was a simple black text on white background page.

While it may have seemed a bit ambitious to archive the web, the Wayback Machine gets about 500 hits a second. And it turns out that the out of print materials on the web are often just as valuable as the in print information on the web. People are looking for the way things were for historical or cultural research reasons and this tool makes it possible.


The Grateful Dead started a tradition in the 60s of allowing people to record their concerts and share them with others – this tradition of tape trading caught on and lots of bands were doing this. Following in this tradition, the Internet Archive decided to offer unlimited storage and unlimited bandwidth for free to any band who wanted to provide recordings of their concerts to the archive. It’s a bit different than tape trading, but an amazing idea! They are getting 1 or 2 bands a day – around 30,000 concerts now and it’s working! Overall the community is building the best metadata Brewster’s ever seen – beautiful work supported by a community – just what I love to hear!!

This shows that librarians can provide a role other than providing information – they can provide back end storage for information. By giving people like these bands a place to store their music for free, the Internet Archive made it so that concerts are now available online for those in search of them!

Moving Images

1000 movies that are out of copyright are available via the Internet Archive. Interestingly, the things that are popular are movies you can’t get any other way – movies you wouldn’t expect people to be interested in at all – government films, social behavior films like the ones you saw in high school when you had a substitute teacher – they’re fantastically popular. Brewster theorizes, and I tend to agree that people are using these videos as research tools to see what things were like culturally at different times in history.

Brewster is a follower of the “it’s easier to apologize than ask permission” philosophy and it has worked very well for him and the organization. You probably have a closet of video tapes that are just waiting to go online – so put them online and if people ask you to take it down – take it down. One example that most of us have probably seen are the Lego movies. Brewster found this genre of movies fascinating – but he mentions that if it weren’t for the free storage on the archive (pre-YouTube) these movies may never have been so widely spread. He described this as, we as the library supporting a community that had no home before. We’re here to put things of shelves and give things away – so why not put things online and give them away?


The Internet Archive only has 1 week of TV available so far – 9/11 – 9/18/2001. This shows a full picture of what people were watching during that horrible week. (update: I may have misunderstood – as I view the archive site I see more than just this….)

Apparently there is someone in North Carolina out there recording TV non stop on 20 channels in DVD quality. Apparently it costs him about $15 per video hour to digitize and has over 50,000 videos in his archive. You can’t get just one point of view (need multiple channels) news may say it’s fair and balanced – but it’s not – you don’t just want John Stewart as your archive of news :)


Not much because of licensing issues – it’s doable – just not legal yet.


This is where Brewster see the biggest opportunity for traditional libraries to participate. We have in our charge the responsibility to distribute print/books.

We, as librarians, have to work very hard on text. Look at what we did with journals – we handed them to many corporations and now we have to rent them back from them :( if we had never let it happen in the first place we wouldn’t be wondering how to digitize our journals now. The same thing is going on with monographs now – we’re handing them over to corporations – we should be doing this ourselves instead and the Internet Archive wants to help.

There are 26 million books in the Library of Congress – one book is about 1MB that’s 26TB in the Library of Congress. For $60,000 you could have the entire Library of Congress digitized.

Brewster’s goal sounds like a simple one – “one webpage for every book ever published.” What would it take to do this?

First off, we’d have scan a whole heck of a lot of books – and get the catalog data.

The archive has experimented with a few methods, first they worked with the million book project – they shipped their books to India and they learned not to ship their books to India. Brewster recommends that you have the Indians scan the books they like – but keep your books to yourself. Instead they found that for 10 cents a page they could scan their own items in house. They came up with the scanner and have a person turn the pages of the book – they tried the robots but they weren’t great (may be better now). At the University of Toronto this method produces a million pages a month.

So, for the cost of copying a page at Kinkos you can digitize it and add MARC records and share with the world. Most importantly it’s being done by librarians – our of the corporate sphere. We need to demand the right to give our books away – not have our books owned by corporations who will rent the content to us with exceptions tied to it.

Some quotes from Brewster: “Please help support these scanning centers while they’re up and running … take collections that you’ve got and have them digitized and start building services around them.” If we’re going to build one web page for every book, we’re going to have to scan a lot of books. One option of a service you could add is a scan on demand link to your catalog. Have patrons click this link to have a book scanned – same cost as ILL – might as well scan it and put it on the web for anyone to use.

Then you can provide your digital copies via ILL, Brewster states: “I don’t know what loan means in the digital world – but let’s figure it out!” Why wait for someone else to tell us?

Next, let’s scan all the microfilm. Someone came up to Brewster after one of his talks and said – “we’ve done this before – it’s called microfilm.” So why not digitize our microfilm as well? For less than 10 cents a page they can do all microfilm. The Internet Archive is actually doing a large scale microfilm scanning project right now using the Carnegie model. Apparently Carnegie would build your library for you if you promised to stock it with books and materials. So the The Kahle/Austin Foundation will donate a microfilm scanner to your organization for X years if you the library will keep it up and running for X hours a week. This only costs labor and time and no money has to change hands. In the end we’ve digitized all of our microfilm and made it more accessible.

This made me think of a question – if years ago people said you should microfilm everything and now everyone’s saying you should digitize it – what’s to say that in another 50 years there won’t be another format? This sounds to me like a never ending loop – but at the same time it sounds like such an obvious progression given the technology we have and the types of users we’re dealing with.

Next, we need better selection – right now we’re just digitizing whatever we’re handed – this means we don’t have full collections. Because of this the Internet Archive now has 90 sponsor collections – “We need help!”–Brewster asks that we pick an area of cataloged material and share that digitally – think outside of your own library. For some reason librarians seem to think that they’re only responsible for digital copies of materials they have in their own library – keep digital copies of things from other libraries – why only have digital copies of items you have in print? You want a full collection on your area of study for your library. This was something I was working on at the Seminary. I was finding digital copies of materials I thought would be of interest to our students and importing those OCLC records into our catalog. Just another way to provide access to data.

The next step according to Brewster is to build the catalog and “we finally need to do this FRBR thing – come on guys, it’s not that hard!!!” Even if the digital copy of the book isn’t available yet, it makes sense to provide pages for the book with catalog data that pulls information from sites like Amazon and other book information sites.

Code4Lib – Day 1
Originally uploaded by nengard

When the books are available, we need to work on our displays. Many of our displays are lacking. We need better search functions, open APIs to allow people to re-purpose our data in ways that make sense for them. We also need to make book images with pages that flip, provide the ability to zoom in and printable. In fact the Internet Archive offers a service where people can print books out from their service in real paperback looking formats.

Code4Lib – Day 1
Originally uploaded by nengard

Another option is to use the One Laptop per Child as an ebook reader. The kindle handles ASCII formats okay – but not the types of images that we’re creating for our digital collections.


We have to work together on building this! We can’t just check back in a year and see what’s happening – instead of waiting for others to do the work – why not contribute? We want to be able to build some great services that will allow people to bulk download these materials and re-purpose them if they want.

One way is to join the Open Content Alliance – there are over 80 libraries now. It’s free to join, you just have to contribute.

The next step is to get service layers in place – this is where the code4libers come in. We have the skills to make the Internet Archive even more accessible and valuable.

Questions & Answers

Dan Chudnov asked what he called “tough questions” – now that some companies like Reed Elsevier are trying to change their business models from journal sales to other routes, is there an opportunity to go and buy up their journal services so we get our data back?

Brewster’s answer: there is a way to do this – some people are trying – until it comes to the point where they aren’t making money any more we’re going to have to keep scanning ourselves

Dan’s other question – is power an issue?

Brewster – power is costly, but not running out any time soon.

Another question: the data is only good as long as the disks are still spinning – how do you make it last for years?

Brewster: the question is a good one – the real way to have long term preservation is to have access – access drives preservation. dark archives lead to data being lost. we have to replace our machines every few years to keep up. tapes suck! have you ever tried to read them back??? if there are at least 5 copies – 5 organizations then I can sleep

Real Conclusion

“if you’re frustrated enough – please come and help!” — Brewster

What an amazing way to stop! What an amazing way to start the conference! So many people were completely inspired, I can’t wait to see what comes of this talk – I hope some amazing APIs start popping up!

[update] Video online [/update]

Technorati Tags: , , ,

Using DOIs in Blogs

When I was at the Seminary we were looking for a persistent identifier for our digital collections. We ended up choosing to use DOIs. So, when I saw this press release I thought – cool – we made the right choice:


Lynnfield, MA. February 12, 2008. — CrossRef, the association behind the well-known publisher linking network, announced today that it had launched the beta version of a new plug-in that allows bloggers to look up and insert DOI®-enabled citations in the course of authoring a blog.

The plug-in, which is available for download at:, allows the blogger to use a widget-based interface to search CrossRef metadata using citations or partial citations. The results of the search, with multiple hits, are displayed and the author can then either click on a hit to follow the DOI to the publisher’s site, or click on an icon next to the hit to insert the citation into their blog entry (as either a full citation or as a short “op. cit.”).

Thanks John for pointing it out.

Technorati Tags: ,

LOC & Flickr

I taught a class on Thurs on the 2.0 Office. At the end I had some extra time so I showed some fun social tools that you can find professional uses for. One of these tools was Flickr. Well, it turns out (thanks David for pointing it out) that the Library of Congress has come up with a pretty awesome way to use Flickr.

The project is beginning somewhat modestly, but we hope to learn a lot from it. Out of some 14 million prints, photographs and other visual materials at the Library of Congress, more than 3,000 photos from two of our most popular collections are being made available on our new Flickr page, to include only images for which no copyright restrictions are known to exist.

This is a great idea!! I love it!

Technorati Tags: ,

Stop Making Sense

Last night I attended a talk at Princeton title Stop Making Sense: On Collecting, Sorting and Presenting Data presented by Rudolf Frieling, Curator of Media Arts at SFMOMA, San Francisco. I have to start by saying that the artsy parts lost me! Frieling would show and art piece and say – of course you’ve seen this or – you know this – and I’d be thinkin “huh? should I?”

Other than that – this was an interesting talk about how we organize our data and how technology is changing so fast and so much that our delivery methods and storage methods are not going to be the delivery methods and storage methods of the future – so how does one successfully archive media materials? When Frieling was introduced, the professor mentioned a few stories that were a bit funny – but also very sad if you think about it. The first was that when presenting in a newly built theater, he found that he could not play his VHS tape because the people who designed the theater had decided that VHS was no longer a valid storage format. The other was about a store here in town that actually sold its entire collection of VHS tapes to an artist so that he could make a sculpture out of them – this store no longer sells VHS tapes. The final story was about the library at the university no longer storing VHS tapes. He had approached them to ask for a space in the the high density storage unit for his tapes and the library said they were no longer keeping tapes and that anyone who had provided to the VHS collection at the library could come pick up their items or they would be given away first come first serve.

Along those lines, my husband and I donated all of our VHS tapes to the local public library a couple of years ago – the plan being to replace them with DVDs – a media type that takes up less space on our shelves and that we found ourselves using more than VHS.

Frieling provided some keywords for his talk (I didn’t catch them all): collecting, linking, presenting all in terms of data. The fact of the matter is (and we librarians know this already) not everything is available online and if it is – it’s possible that it’s not accessible because of hardware, software, or firewall reasons. He spoke of a tool that he and others had developed for CD-ROM that no longer worked on current systems due to hardware and software changes in these systems. He spoke of websites developed at the beginning of the web that no longer work as they were intended because they were developed with system limitations in mind. The long and short of it is that systems change and as archivers and curators how are we going to preserve information for future generations?

Freiling mentioned a TV show collector by the name of (excuse mis-spellings – the font was small and I was in the back of the room) Pentti Pajukallio. This man has spent most of his life recording TV shows and collecting these VHS tapes. He only stopped to have open heart surgery and even then his wife recorded what she could for him. The question is that what value does this collection have to anyone but Pentti? And if it does have value for others how will we access it?

One of the best slides (for me) was the one of a pile of 3×5 index cards that Frieling had put together as his first database. These cards contained bibliographic references that were of use to him. He keeps this “database” today because it has nostalgic value for him – but most of the references are probably inaccessible or unavailable – or even out-dated. This collection only has value to him or those studying him. Another great point that he brought up in reference to his note cards – information like technology is always changing so databases like this are not always going to be valuable – so are they worth archiving and making accessible? I don’t know – that was the question of the night.

One great quote was when Frieling mentioned that now that we have search engines and the world wide web it’s even harder to find the “pearl among the rubbish” when we’re browsing through collections. Books are a strong model to provide content. They can be browsed, you can jump back and forth, or you can read cover to cover. This 2D model (sounds a bit like Weinberger’s first order of order) allows the user to read the text as is or randomly, but it’s physical – it’s the pearl and it’s easy (in theory) to find because it’s not (in theory) surrounded by rubbish.

When it comes to webpages we may think of the “home” page as the entry point into our site but in reality people are entering our sites from every which way because search engines are indexing all (one again – in theory) of our pages and providing them in piecemeal to searchers. Frieling described this as users coming at our sites diagonally instead of straight on like they do with books. This means they only get parts of the information we’re providing and not getting the whole picture.

One way to look at information or media is that each item has two stories. One story is that of the artist or the collector and is usually personal in nature. The other story is that of the viewer. This story gives us the perspective of the outsider. This is the perspective that we’re giving in our catalogs – the perspective of the cataloger when viewing the item – so why not let the other “viewers” (our patrons) add their perspectives as well? This isn’t something that Frieling said exactly – just something I thought when he started talking about the two stories. What he did show us was Steve and how allowing others to add tags to art gave the piece a whole new perspective and a whole new value.

He ended by showing us the Way Maker (if you have a link please share it with me). This program is downloaded to your phone and then you attach your phone to your body and you record your life through your eyes. Does this hold value for anyone but you? Maybe not – but it allows you to see your life from another perspective. It shows you things that you maybe weren’t paying attention to throughout the day – and maybe even makes you more aware of your surroundings. Would a series of videos like this be worth archiving? Who knows – maybe it would be educational for future generations or other cultures to see what a day in the life of Nicole is like. Would I do it? Nope! I don’t need to go to that level of sharing my life – I have this blog and my personal networks – that’s enough for me :)

It was a great talk, while the art aspects were over my head, I’m glad I attended – I just wish that there were more links provided or that the slides were available as I’d like to link you to more information and I don’t have the time just now to do the research on Pentti or the Way Maker.

Metadata Tools

I just read on a few quotes from the the report of the RLG Programs metadata practice survey on Lorcan Dempsey’s blog (I haven’t read the whole report yet) and wanted to add to his comments. The report says:

… RLG Programs surveyed 18 Partner institutions1 in July and August 2007 to obtain a baseline understanding of their current descriptive metadata practices. Although we saw some expected variations in practice across libraries, archives and museums, we were struck by the high levels of customization and local tool development, the limited extent to which tools and practices are, or can be, shared (both within and across institutions), the lack of confidence institutions have in the effectiveness of their tools, and the disconnect between their interest in creating metadata to serve their primary audiences and the inability to serve that audience within the most commonly used discovery systems (such as Google, Yahoo, etc.).

I have heard this many times. At our library we use a combination of metadata standards and the MarkLogic XML Content Server to deliver the information to our patrons.

That said – while our delivery system is awesome, creating a METS document is one of the most cumbersome things I’ve ever had to do! This standard is amazing – it has such power and I can’t think how to make it less stressful to create documents – but it just seems like someone created this standard to torture librarians. This is probably why so many librarians are unsure of their tools and their metadata.

I also find that there are many choices – somewhat too many choices on how we can format our data. There is Dublin Core, MODS, MARCXML, etc. As a cataloger I say we need to use MARCXML – it holds the most data and stays in line with our print collections. As a programmer I say MODS is the easiest to read and retrieve data from. And as a lazy person (yes I too can be lazy) I say Dublin Core because I only need to enter minimal information.

But how do you make these decisions? And have I gotten totally off track? I don’t have any hard and fast answers for you – all I know is that I sympathize with librarians who are unsure and think I should go and read the entire report before adding anything else.

Open Source MODS-generating software

Via Metadatalibrarians:

The University of Tennessee Digital Library Center is proud to announce the release of the DLC-MODS Workbook, version 1.2 under the GNU General Public License version 3.

The DLC-MODS Workbook provides a series of web pages that enable users to easily generate complex, valid MODS metadata records that meet the 1-4 levels of specification outlined in the Digital Library Federation Implementation Guidelines for Shareable MODS Records, (DLF Aquifer Guidelines November 2006).

Developed by programmer Christine Haygood Deane under the direction of metadata librarian Melanie Feltner-Reichert, this open source client-side software provides control of date formats and other problematic fields at the point of creation, while shielding creators from the need to work in XML. Metadata records created can be partially created, saved to the desktop, reloaded and completed at a later date.

Final versions can be downloaded or cut-and-pasted into text editors for use elsewhere.

Developed in support for our state-wide digitization project, Volunteer Voices, we hope this system will assist others in their efforts to create valuable digital libraries also. The software can be viewed here and downloaded here.

Please address comments and questions to Melanie Feltner-Reichert ( ) and Cricket Deane ( ).

Technorati Tags: ,

New Mark Twain Digital Collection

I just got this via a few of my mailing lists and thought I should share with you all.

I'm happy to announce that today the University of California launched the beta version of Mark Twain Project Online, a digital critical edition of the writings of Mark Twain, providing access to more than twenty-three hundred letters written between 1853 and 1880, including nearly 100 facsimiles of originals. The site is driven by metadata captured in METS records, the content was encoded in TEI P4, and the search, browse and display functionality was built using the XTF (the eXtensible Text Framework).

Read the full press release here.

Technorati Tags: ,