This HTML page is an extract of the deliverable “Report on Backlist Data and gap analysis” available to download as PDF (1.6 mo) or EPUB (800 ko).
Without precise data about the composition of ebook backlists in Europe, it is not possible to evaluate the best workflows and the cost of remediation of these ebooks. Without precise data about new releases, it is difficult to evaluate the potential evolution of the number of accessible ebooks before and after the European Accessibility Act comes into force, and therefore the success of the EAA in the ebook publishing industry.
The data presented here were necessary to effectively set up the next steps of the project and to define new research possibilities on the topic; they also provided a basis for future documents and reports that will be published in the context of the ABE Lab project.
This report presents the results of the analysis of the size and the composition of the ebooks backlists and their segmentation and provides trends and insights of use for the scope of our research. It first present our approach including the difficulties we faced and then provides overview and segmentation we achieved. The Insights section recapitulates the main findings and the Outcomes section details the use of this work for our project.
Approach
For this study we define ebook backlists in Europe as the collection of all ebooks available on the EU Market. Since there is no central source at the European level that provides this information, many publication registration offices, distributors, aggregators, resellers and national libraries were contacted. Given that all these organisations handle only a set of partial data and that these data sets have overlaps between them, we need to be careful to make conclusions. Ebooks made available on the EU market via e-commerce platforms, distributors and retailers operating on the international market also make up a large proportion of the backlist, especially considering ebooks from the UK, the USA and Canada.
We faced two main difficulties in the data collection and analysis activities:
catalogues overlaps that we resolved going back to registration offices to help with filtering and make sure the true data appears;
book categorization schemas1 not harmonised across EU countries. We based our work on Thema, a recognized international standard used in many countries, but national classifications like CLIL in France or NUR in The Netherlands still exist and mapping them to Thema has been challenging even using provided mappings.
More details about these difficulties and how we managed them are given in the following subsections.
Counting titles
To get information about the number of ebooks that have been published in EU member states, we started by contacting all EU ISBN agencies. Summing together the numbers of national productions should give a good first estimate for the total. However there are some caveats:
some titles are published in more than one digital format (like EPUB, PDF, KF8), which results in counting manifestations2, not unique works;
some ebook titles get a new ISBN with a new version, so again we count not unique works, but different manifestations of the same ebook;
some ISBN agencies do not assign ISBNs on a per title basis, instead they give publishers a large range of numbers to use. As a result, these ISBN agencies do not have a record of how many titles are published;
not all ISBN agencies are informed when ebooks are no longer available on the market.
To elaborate a more complete estimate of what is on the EU market, it is not sufficient to know what is published locally, but we also have to take into account what is imported from abroad. Especially with certain categories of non-fiction ebooks (like computer books), we see that the local production is rather limited and imported ebooks in English are dominant. A part of these titles are distributed via local distributors and sold by local retailers, but next to that we also see online platforms that operate worldwide, such as Amazon, Kobo, Google and Apple. To take this scenario into account when calculating the size of the backlists, we contacted distributors3, aggregators4 and resellers5. There are caveats here as well:
some distributors and aggregators handle only a part of the market, for example only trade books or scientific books, or only books in certain languages, meaning that relying on data provided by a single distributor or aggregator is not sufficient to obtain a complete overview, and we must therefore extend the collection to multiple parties. In addition, when we receive aggregated data and not detailed information to the individual ISBN level, it is impossible to know if different collections overlap and how relevant this overlap is, so we need to be careful with adding these together;
some distributors and aggregators supply both titles from non-EU publishers and titles published in an EU country. Without detailed data and information it is impossible to distinguish between titles that are published in the country itself and titles produced by foreigner publishers, which also might have an overlap with the titles we count in other EU countries;
retailers sometimes collect titles from multiple distributors and aggregators, so it is also not possible to rely purely on the numbers resellers provide; platforms that operate worldwide hardly provide detailed information about their collections. And since they operate in a different way, their collections are not even directly comparable to those of the traditional ‘publisher-distributor-retailer’ chains. For example, Amazon does not only provide ebooks with an ISBN, that is, produced by publishers, but also a lot of self-published titles, often without an ISBN.
National libraries are another possible source of information about ebooks published in a country, and sometimes they also play a role in providing ISBNs to publishers. In fact, in some cases, when we requested data from the official ISBN agency, it was the national library of the country that provided us with the answer. Sometimes we even had direct contact to get information on the availability of ebooks. Especially in countries where publishers have the legal obligation to deposit a copy of their publications at the national library, it may have an overview of what is published in the country and therefore represents a valuable source of information.
Classifying titles
Categories
Book categories are a way to isolate and regroup books with common characteristics. Those characteristics may explain production choices and the technologies used.
To group titles for the purposes of our project, we chose to refer to the Thema subject category scheme6 established and maintained by EDItEUR. Thema aims to be the scheme for a global book trade and is currently the most commonly used classification methodology. Even if the use of Thema classification has progressed a lot in the last few years, it must be noted that is not the only classification used, and since it is still relatively new, not all ebooks have been assigned Thema codes yet.
To resolve this inconsistency, EDItEUR provides a series of documents mapping codes from different book schemas to Thema codes.7 Because standard classifications have different logics, those mappings may not be one-to-one and sometimes need interpretation. One example is the CLIL8 to Thema mapping, where the Young Adults category can be mapped to two main Thema categories: Fiction (Thema code F) and Children, Teenage and Educational (Thema code Y).
As a consequence, we had to spend time on understanding differences in classification methodologies and which choices to apply to mappings.
Formats
Ebooks can come in different formats. For the purposes of our project, we focus on the two mainstream formats widely adopted worldwide: PDF and EPUB.
PDF (Portable Document Format) is a document format developed by Adobe for document interchange, and often used for digital documents with a complex layout or when it is necessary to reproduce the structure and graphics of a paper document or book. Initially developed for print reliability and digital conservation, the format is based on PostScript, a computer language for describing the positioning of characters and graphic elements absolutely on the page (similar to having x and y coordinates to position each element on the page). As it went to be used for digital consultation, the format evolved to respond as best as possible to this use principally with the addition of a semantic descriptive layer composed of XML9 language tags. The variety of tagging that can be added is currently limited to 28 elements10. The PDF/UA ISO11 standard provides definitive terms and requirements for accessibility in PDF documents and applications.
EPUB (Electronic PUBlication) is an open file format for electronic publications based on Web Standards (HTML, CSS ,JavaScript). The first version, named OEBPS 1.0 (Open EBook Publication Structure) was approved in 1999 by the Open eBook Forum, which later became the International Digital Publishing Forum12 (IDPF). EPUB 2.0 was released in 2010, followed in June 2014 by version 3, in which for the first time in the specifications were included accessibility features deriving from the Daisy specialized format.13 Just after the release of EPUB 3.1 in January 2017, the IDPF merged into the World Wide Web Consortium (W3C), making the EPUB an official W3C recommendation (standard). The last version, EPUB 3.3, was published on May 25, 202314. EPUB is a native semantic format allowing the use of numerous taggings from different standardised languages such as HTML, ARIA, MATHML, SVG and others.
Years
Another way to classify titles is by year of production. We realised that book categories and formats are not enough and that we also had to make a per-year segmentation. As already mentioned, over the years formats have also evolved in terms of the accessibility features supported. The level of accessibility also depends on the accessibility guidelines available at the time the ebook was produced. The first version of EPUB Accessibility 1.0, the guidelines for creating accessible EPUBs, for example, was published in 2017. We can therefore assume that EPUBs created before this date will have a very low or zero level of accessibility, as publishers lacked clear reference specifications at the time of their production.
The increasing focus on the accessibility of digital content, including ebooks, has also led to improvements in ebook production tools, which in the last years have progressively introduced support for accessibility features, to allow publishers to produce ebooks that are more and more accessible and compliant with international guidelines. In parallel, production workflows have also evolved to take accessibility into account.
All these aspects - the format and its version, the availability of accessibility guidelines, the support of accessibility by production tools and the adaptation of workflows - are reflected in the way ebook files have been created over the years. For building a representative sample set for our research, we have to take this development into account.
Overview of the EU ebook backlist
We have collected direct basic data from 18 countries and detailed data from 5 countries. The collected data provide precise information about the number of titles and the detailed ones contain repartition by categories and formats. We’ve integrated those collections with the annual data published by the Federation of European Publishers (FEP) to get an idea of markets.
A considerable amount of ebooks currently on the EU market are provided by e-commerce platforms and resellers operating on the international market. Many of these titles are from countries like the USA, UK and Canada. Since these operators are very unlikely to make their data available, ebooks on the EU market from countries outside the EU are not investigable. However, given the market knowledge we already have, we can assume that the trends identified for the data we have available are also applicable to the data of titles from outside the EU.
Since important differences exist between EU countries when it comes to their ebook backlist and this could appeal to different conclusions, we first present a general overview of the data collected and of the growth in ebook production at the European level, then we present more detailed views per catalogue.
Total
The last available FEP annual statistic report15 presents data from 2021. Based on declarations from the national book publishing associations, it reported a total of 13.4 million titles available in the active catalogues of European publishers. 3 million of them are ebooks, representing 22% of the titles on the market, for 12% of sales.
Based on the summation of the numbers we collected from individual EU countries, we established that, in early 2023, the backlist of ebooks available in the EU market exceeded 3.5 million.
Since ebooks can be easily traded cross-border, the actual number of titles available to consumers is much higher (see section Titles from outside EU).
Catalogue growth
The evolution of the book market as reported by the Federation of European Publishers16 is very important to the topic we are discussing. Between 2005 and 2021 we have seen a steady increase in the number of book titles in commerce (+260%). While the increase in the number of new titles has been affected by the COVID-19 pandemic, this does not apply to ebook titles in commerce.
Our data collection reflects strong differences between territories, as we identified in the detailed data provided:
in France 42.000 unique titles were added in 2013 compared to 115.000 titles in 2022;
in Italy the production is stable and steady, with 29.000 unique titles added in 2013 versus 24.000 titles in 2022, but the years 2014 to 2021 had around 35.000 titles per year;
in the Netherlands the number of ebooks added to the market has been relatively stable over the years. In 2012 some 10.000 titles were added, and in 2022 this was around 8.000.
Number of titles in the backlist per countries
The following list presents the data we’ve collected and consolidated. It is organised per descendent number of ebooks made available per country. We observed that the detailed data obtained came from 5 of the 6 countries with the biggest backlists.
Germany: 1.055.369 currently in distribution (source: MVB GmbH). We notice overlaps with countries with active German speakers (Austria, Switzerland).
France: 952.416 currently in distribution, for a total of 1.310.274 ebooks registered (source: Dilicom). We notice exchanges with Canada (Quebec) and overlaps with other countries with active francophones speakers (Belgium, Switzerland).
Italy: 376.097 (source: IE - Informazioni Editoriali).
Spain: 336.757 (source: Dilve).
Poland: 138.415 registered (but expected higher) ISBN service at Bibliotheka Narodowa .
The Netherlands: 102.000 registered at ISBN.NL and 70.000 available from CB Logistics.
Czechia: 126.229 ebooks registered at Czech National ISBN agency.
Sweden: 107.561 ebooks registered at National Library of Sweden, but actual numbers expected significantly higher.
Hungary: 84.571 registered at National Széchényi Library.
Denmark: 81.324 Danish titles available from Publizon (but only 39.388 ebooks registered at DBC Digital).
Portugal: 58.000 ISBNs associated to ebooks according to APEL.
Greece: 30.059 ebooks registered at Greek ISBN agency.
Lithuania: 18.300 ebooks registered at Lithuania National Library.
Slovenia: 17.868 according to Slovenian ISBN agency.
Estonia: 11.685 according to Estonian ISBN agency.
Bulgaria: 14.129 ebooks registered at ISBN Bulgaria.
Latvia: 6.739 ISBNs assigned to ebooks, National Library of Latvia.
Ireland: 4.914 Irish titles according to Nielsen Bookdata.
Malta: 583 (source: National Book Council).
Missing data
Austria
Belgium
Croatia
Cyprus
Finland
Luxembourg
Romania
Slovakia
Titles from outside EU
Buying ebooks from non-European retailers is easy since the deliverable is a file that does not pass through border controls. It is especially the case for ebooks in the English language, which are very popular for certain categories (like computer books or scientific publications) and often outnumber local productions in these categories.
Since the EAA targets ebooks on the European market, we focused on data on ebook sales by European retailers. It is important to notice that international e-book e-commerce platforms operate on the European market too, but getting accurate data from them is not simple as they operate out of traditional distribution channels.
We provide here a quick overview of the collections available through those platforms:
Amazon Kindle. Though Amazon does not publish exact numbers about ebooks that are available for the Kindle devices and the Kindle app, some sources estimate that more than 14 million titles are currently available17. However, as this number also takes into account a large number of titles without an ISBN - mainly self-published ebooks -, we can not just add this number to our estimate directly;
Apple Books. Exact numbers are not available, and availability differs from country to country due to licensing deals. Despite this, Apple is known to offer millions of ebooks and audiobooks;
Google Books. According to some sources, Google Books offers more than 40 million ebooks in 50 Languages18, including 10 million ebooks for free.19 Here we have the same issue: this number can not be compared directly to the backlist estimate as we defined it;
Kobo. Kobo claims to have over 5 million ebooks and audiobooks available for reading directly on their e-readers and apps20. In some countries, they work together with local retailers (like Bol.com in the Netherlands or FNAC in France) to provide subscription services including lots of local content (in the case of Kobo Plus offered by Bol.com, about ‘hundreds of thousands’).
Segmentation
By category
To compare the different EU member state backlists, we needed to split the complete offer of ebooks into several categories, like fiction books, biographies, children's books, books on art, etc. Since we expect that different categories of publications may have different complexity and often specific accessibility issues, we wanted to make sure we have a good representation of ebook categories and genres in this study.
Since many different categorization methods are in use, we needed to standardise as much as possible. For this research, we chose the Thema categorization scheme since this is becoming more and more the international standard many publishers and retailers use. However, given that Thema is relatively new and not all e-books have been assigned Thema codes yet, some mapping from older schemas had to be applied to make proper estimates (see title Categories of the Approach section for details).
We observed that the ventilation of ebooks per Thema code in the 5 markets for which we had complete data shows a strong disparity and does not allow for a European level modelisation. For example, in Germany, non-fiction represents about 86%, whereas in the Netherlands it is considerably less: 58%. Fiction ebooks (Thema code F) are the most represented category everywhere but with a range from 13% (Spain) to 42 % (The Netherlands).
The following table and figures show the different shares by Thema code for the 5 countries who provided detailed data.
Thema code | France | Germany | Italy | Spain | Netherlands |
---|---|---|---|---|---|
A: The Arts | 1,74 | 1,96 | 4,40 | 3,71 | 1,31 |
C: Language and Linguistics | 0,44 | 2,60 | 0,95 | 2,09 | 0,73 |
D: Biography, Literature and Literary studies | 19,93 | 5,98 | 10,55 | 8,36 | 5,76 |
F: Fiction and Related items | 20,24 | 14,81 | 37,87 | 13,98 | 42,82 |
G: Reference, Information and Interdisciplinary subjects | 0,00 | 4,87 | 0,25 | 1,02 | 0,54 |
J: Society and Social Sciences | 9,70 | 9,02 | 8,97 | 12,63 | 5,95 |
K: Economics, Finance, Business and Management | 2,44 | 8,91 | 3,72 | 4,35 | 5,86 |
L: Law | 0,77 | 4,50 | 2,57 | 5,30 | 5,06 |
M: Medicine and Nursing | 0,34 | 6,29 | 1,66 | 7,96 | 1,35 |
N: History and Archaeology | 4,51 | 3,22 | 3,31 | 4,32 | 4,15 |
P: Mathematics and Science | 0,87 | 7,08 | 1,10 | 2,43 | 0,87 |
Q: Philosophy and Religion | 2,67 | 4,75 | 5,70 | 5,94 | 5,51 |
R: Earth Sciences, Geography, Environment, Planning | 0,54 | 1,97 | 0,54 | 1,10 | 0,17 |
S: Sports and Active outdoor recreation | 0,00 | 0,58 | 0,70 | 0,77 | 1,04 |
T: Technology, Engineering, Agriculture, Industrial processes | 0,48 | 5,00 | 0,76 | 1,98 | 0,18 |
U: Computing and Information Technology | 0,25 | 3,65 | 0,67 | 0,93 | 0,90 |
V: Health, Relationships and Personal development | 0,00 | 3,26 | 6,65 | 2,42 | 2,58 |
W: Lifestyle, Hobbies and Leisure | 6,78 | 3,30 | 4,04 | 2,02 | 3,71 |
X: Graphic novels, Comic books, Manga, Cartoons | 9,74 | 0,96 | 1,47 | 0,62 | 0,08 |
Y: Children’s, Teenage and Educational | 8,19 | 3,41 | 4,13 | 18,07 | 9,37 |
Unknown | 10,37 | 3,86 | 0,0 | 0,00 | 2,08 |
By format
We managed to obtain detailed data on digital formats of ebooks on the market only for 5 key markets: France, Germany, Italy, the Netherlands and Spain. We retained only mainstream formats21: EPUB and PDF.
The segmentation of the market by format shows very diverse situations, where the German backlist has only 3% of EPUB3 files, but 60% of PDF files, while France and Italy reach nearly 40% of EPUB3 files, versus less than 25% of PDF files.
This disparity does not allow us to make assumptions about the European market as a whole. Consequently, we did not integrate any format-related query in our wishlist for files to be collected for the remediation tests.
The format repartition will have to be studied in more detail at the national and publisher level in order to refine the remediation cost estimate.
Market | % of EPUB2 | % of EPUB3 | % of PDF | % of other formats (ie. HTML, Apps, etc.) |
---|---|---|---|---|
France22 | 12 | 38 | 22 | 18 |
Germany | 34 | 3 | 60 | 3 |
Italy | 36 | 40 | 23 | 1 |
Netherlands | 75 | 10 | 15 | 0 |
Spain | 20 | 15 | 40 | 25 |
By year
We captured the evolution of distributed files formats by year since 2012 for the 5 key markets. The disparity observed can be compared to that already described in relation to the percentage of titles per distribution formats in 2022. Some countries present a linear evolution reflecting the evolution of formats, while others seem to produce the same ebook formats in 2012 and 2022. This difference will affect remediation activities since countries in which the latest version of the EPUB format, EPUB3, which in recent years has become not only the format of choice for ebook production, but also and especially for the production of accessible ebooks, has not been adopted yet, will face a technological debt in addition to the necessary remediation efforts to make ebooks from the backlist properly accessible and therefore keep them on the market.
Market | EPUB2 (2012 / 2022) | EPUB3 (2012 / 2022) | PDF (2012 / 2022) |
---|---|---|---|
France | -11 (from 23% to 12%) | +13 (from 25% to 38%) | -11 (from 33% to 22%) |
Germany | -1 (from 35% to 34%) | +3 (from 0% to 3%) | -2 (from 62% to 60%) |
Italy | -12 (from 48% to 36%) | +34 (from 6% to 40%) | -22 (from 45% to 23%) |
Netherlands | -8 (from 83% to 75%) | +9 (from 1% to 10%) | -1 (from 16% to 15%) |
Spain | Missing data | Missing data | Missing data |
Insights
From the data collected we learn that the European backlist is not homogeneous between countries and therefore cannot be averaged and addressed in a similar way. Even a repartition by size of the national backlist is not sufficient to separate different needs.
We also learn that developments in digital publishing go slow. Newer formats (like EPUB3) are not adopted quickly, and newer possibilities (like fixed layout for EPUB) do not cause formats like PDF to be replaced quickly.
A blind spot is caused by the fact that large international platforms offer a lot of content to the European market. We do not know how much those contents are exclusives (like self published titles) and therefore the needs of remediation for those titles cannot be studied.
Outcomes
Based on the backlist overview we decided what type of titles we needed to collect. As previously mentioned, different categories do often need different types of remediation and generalising between these would give an incorrect view. As a result we decided that it was important to get titles of the following types:
fiction and other ‘mostly text’ publications (including not illustrated children’s books and biographies): we expect those to be mainly reflowable EPUB files with structure issues to address. And within these, we expect older publications posing different challenges compared to the more recent ones;
illustrated children books and graphics novels: we expect those files to be fixed layout ebooks with strong graphical accessibility challenges and therefore a need for textual alternatives;
non-fiction publications: we expect to find complex elements in those files like tables and visual resources;
complex layout publications: we expect those files to be mainly in PDF format with a strong tie between the form and the content with remediation needs including major changes.
As a per format request was not possible, we’ve built a per category wish list23 divided in 4 time periods :
before 2011 where all EPUB files will be EPUB2;
2011-2018 when EPUB3 was available but accessibility principles were not still well understood;
2018-2021 when some publishers started to produce born accessible files ;
2022 and after to represent today’s state of publishing.
For data collection and analysis purposes, actors of the book value chain need a way to describe and categorise publications. That’s what we call book categorization schemas.↩︎
With manifestation we refer to a physical or digital embodiment of a work as defined in bibliographic record standards (https://www.loc.gov/marc/marbi/2009/2009-01-3.html ). As example in the digital world an EPUB and a PDF of the same title are counted as two different manifestations.↩︎
We define a distributor as an entity that collects and distributes files to selling platforms. The distributor also establishes or collects metadata and sends them to aggregators.↩︎
We define aggregators as an entity who collects established metadata and distributes them to selling platforms.↩︎
We define resellers as the entity in direct contact with the client.↩︎
Thema – the subject category scheme for a global book trade version 1.5 , EDItEUR, 2022. Available at https://ns.editeur.org/thema/en↩︎
Thema mappings , EDItEUR, 2023. Available at https://www.editeur.org/151/Thema/#Mappings↩︎
Commission de Liaison Interprofessionnelle du Livre, the French standard for book classification. Available art https://clil.centprod.com/listeActive.html↩︎
Extensible Markup Language (XML) is a markup language that provides rules to define any data. It is standardised by the W3C and can be found at https://www.w3.org/TR/xml/↩︎
A list of Standard PDF Tags is available at https://helpx.adobe.com/acrobat/using/editing-document-structure-content-tags.html↩︎
Available at https://www.iso.org/standard/64599.html↩︎
DAISY Format . Available at https://daisy.org/activities/standards/daisy/↩︎
What is an EPUB file? Available at https://www.edrlab.org/open-standards/epub/↩︎
European Book Publishing Statistics 2021 , FEP 2022. Available at https://fep-fee.eu/European-Book-Publishing-1467↩︎
European Book Market Statistics 2021-2022 . FEP, 2022. Available at https://fep-fee.eu/-Publications-↩︎
How Many Ebooks Are There In The Kindle Store On Amazon? Just Publishing Advice, 2023. Available at: https://justpublishingadvice.com/how-many-kindle-ebooks-are-there/
↩︎How the Google Books team moved 90,000 books across a continent . Ari Mariani, 2023. Available at https://blog.google/products/search/google-books-library-project/
↩︎About Google Books – Free books in Google Books . Available at https://www.google.com/intl/en/googlebooks/about/free_books.html [Consulted on may 2023]
↩︎About Kobo. Available at https://www.kobo.com/us/en/p/aboutkobo↩︎
An abstract of ebooks formats is given as Annex Ebooks files formats↩︎
Total for France is not heading to 100% because 10% of the titles are in bundle sales including both PDF and EPUB3 formats.↩︎
Available as annex to this document: ABELab sample collection wishlist.↩︎