¹¹institutetext: Graduate School of Engineering, Tohoku University
6-6-5 Aramaki aza Aoba, Aoba-ku, Sendai, 980-8579 Japan
[email protected]

Authors’ Instructions

Embedding Digital Signature into CSV Files
Using Data Hiding

Akinori Ito

Abstract

Open data is an important basis for open science and evidence-based policymaking. Governments of many countries disclose government-related statistics as open data. Some of these data are provided as CSV files. However, since CSV files are plain texts, we cannot ensure the integrity of a downloaded CSV file. A popular way to prove the data’s integrity is a digital signature; however, it is difficult to embed a signature into a CSV file. This paper proposes a method for embedding a digital signature into a CSV file using a data hiding technique. The proposed method exploits a redundancy of the CSV format related to the use of double quotes. The experiment revealed we could embed a 512-bit signature into actual open data CSV files.

Keywords:

Data hiding, Digital signature, CSV file, Open data

1 Introduction

Open data is defined as data that is freely accessible, usable, reusable, and redistributable by anyone, subject only to the requirement of attribution and share-alike principles. This concept is integral to the broader movement towards openness and sharing, encompassing open content, knowledge, and resources.

The primary characteristics of open data include:

•

Accessibility: Data should be available to all individuals without discrimination [10].
•

Reusability: Data should be provided conveniently and modifiable, facilitating easy reuse and repurposing [24].
•

Redistribution: Individuals must be free to redistribute the original or modified versions of the data.
•

Attribution: While the data can be used freely, certain open data licenses require users to credit the source of the data.

Another aspect of open data is the FAIR principle [21, 7], where FAIR stands for Findable, Accessible, Interoperable, and Reusable. Here, “interoperability” refers to the ability of data to be integrated with other data and to work with various applications or workflows for analysis, storage, and processing.

Open data is frequently associated with government data [15], as numerous governments globally have committed to making their data openly available to enhance transparency, civic engagement, and innovation. The application of open data has been investigated in various contexts, demonstrating its potential to improve data provision and policymaking [9], contribute to sustainable development [6], enhance transversal skills and global citizenship in education [3], and support open science [14].

Several studies have examined the data formats utilized in open data initiatives. Oliveira et al. [16] identified that Brazilian OGD portals predominantly use CSV, among other formats. Washington and Morar [20] discussed the impact of file formats on collaboration potential, noting that data.gov formats have limited potential but are accessible to users with diverse skills. Yi [23] emphasized the importance of data quality, particularly regarding data completeness and machine-readability, and suggested guidelines for data format selection and data completeness enhancement.

The Comma-Separated Values (CSV) [19] have been widely used as an interoperable data format and method for enhancing its utility. A distinct feature of the CSV format compared to other data formats, such as JSON or XML, is its simplicity and ease of use for tabular data representation. CSV files store data in plain text, where each line corresponds to a data record, and each record consists of fields separated by commas. This straightforward structure makes CSV files highly accessible and easy to manipulate using basic text editors and simple programming scripts. For this reason, many open data files are provided as CSV files [13].

The CSV format is a simple, plain text format ideal for storing and exchanging tabular data due to its lightweight nature and broad compatibility across platforms. CSV files are limited to raw data storage without additional functionalities, unlike the more complex Excel format, which supports advanced features like formatting, formulas, and data visualization. This simplicity ensures smaller file sizes and faster processing, making CSV suitable for large datasets. In contrast, Excel’s robust data manipulation capabilities make it better suited for comprehensive data analysis and visualization tasks. However, it may suffer from larger file sizes and compatibility issues outside of Microsoft Excel.

Because CSV files are interoperable, several projects have tried to handle them for further data use. Alkarkoukly et al. [2] present a reference implementation for transforming CSV data into FHIR resources using open-source tools, addressing interoperability challenges in healthcare. Mahmud et al. [12] propose an approach to generate annotated tables from CSV files, improving semantic structure and reusability. Christodoulakis et al. [4] introduce Pytheas, a method for automatically discovering tables within CSV files, outperforming existing approaches in precision and recall. Mahmud et al. [11] developed a semantic approach to convert CSV data into RDF format and publish it as Linked Open Data on the web, following W3C recommendations. These studies collectively demonstrate the versatility of CSV files and propose solutions to enhance their interoperability, semantic richness, and integration with web technologies, addressing challenges in various domains, including healthcare, open government data, and semantic web applications.

A problem with the CSV file format is that it does not have a mechanism to ensure data security. Since the open data provided by governments are the basis of public research and surveys, distributing tampered data can considerably threaten society. The digital signature [8] is a framework to ensure authenticity and integrity, commonly used for document workflow [17]. When the data formats for Microsoft Excel (such as xls or xlsx) are used, we can sign the file using the digital signature technique [1]. However, since the CSV format is a simple text file, inserting a digital signature into a CSV file is impossible. Therefore, this paper proposes a method to embed a digital signature into a CSV file using a data hiding technique.

2 Previous Work

2.1 CSV file format

CSV is a file format widely used for tabular data. It is a plain text, where a line makes a record and items in a record are separated by commas. Fig.1 shows an example of a CSV file where the data has one header and two records, each with four fields. Each field, such as H1 or Hello, may be enclosed by double quotes.

H1,H2,H3,N1

"Hello","my","world",3.0

"Nice","to","meet",8.0

Figure 1: An example of a CSV file.

In actual open data, the format of CSV files is not necessarily consistent. According to Mitlöner et al. [13], only 88% of open data use commas as field separators and 10% use semicolons. Therefore, there was an attempt to make a standard of the CSV format as RFC [19]. Fig.2 shows the syntax description of a CSV file proposed in RFC4180 written in ABNF [5]. According to the standard (RFC 4180), a CSV file is composed of one or more records delimited by CRLFs. The file may have a header line, but there is no way to determine whether or not the first line is the header line. A record (and a header) has multiple fields, where a field is either an escaped or a non-escaped string. An escaped string is a string enclosed by double quotes, where the string inside may contain commas, CRs, LFs, and double quotes (a double quote is escaped by one more double quote). A non-escaped string can contain any character besides commas, double quotes, CRs, and LFs.

file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

field = (escaped / non-escaped)

escaped = DQ *(TEXTDATA / COMMA / CR / LF / DQ DQ) DQ

non-escaped = *TEXTDATA

COMMA = %x2C

CR = %x0D

DQ = %x22

LF = %x0A

CRLF = CR LF

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

Figure 2: The syntax description of the CSV file format [19].

2.2 Digital signature

A digital signature is a robust cryptographic mechanism that guarantees the authenticity and integrity of a digital document [18, 8] . It harnesses a blend of mathematical algorithms and cryptographic keys to achieve a high level of security, providing reassurance in the digital realm.

The core principle behind a digital signature involves using a public-key cryptography system [18]. The signer possesses a private key, which is kept confidential, and a corresponding public key, which can be freely distributed. When a document is signed, a mathematical hash function is applied, generating a unique digital fingerprint. This fingerprint (a hash value) is then encrypted with the signer’s private key, creating the digital signature.

The recipient of the signed document can verify its authenticity using the signer’s public key. The recipient obtains the original hash value by decrypting the signature with the public key. They can then independently calculate the hash of the received document and compare it to the decrypted value. If both values match, it confirms that the document originated from the claimed signer and has not been tampered with since the signature was applied.

Digital signature streamlines workflows by enabling secure electronic document signing and verification, eliminating the need for physical documents and manual processing [17]. It is also used to maintain the authority and integrity of open data. Wong et al. proposed a system architecture for signing open data [22]. Their proposal includes a public key infrastructure and HTTP servers that provide the data. However, they did not discuss how the signature is attached to the data file; it may be assumed that the provided data is in a format that can involve a signature, such as xlsx.

Unfortunately, a CSV file cannot include a signature because it is a plain text file with no inner structure. Therefore, even if we calculate a signature for a CSV file, we need to deliver it separately, which may reduce the value of the signature.

3 Proposed Method

3.1 Data hiding in a CSV file

The proposed method embeds arbitrary binary data into a CSV file. To do that, we exploit the redundancy of quoting a field. As shown in Fig.2, a field can be either an escaped string or a non-escaped string when the string does not involve special characters (double quote, comma, CR, and LF).

Consider a CSV file has $N$ rows and $M$ columns¹¹1The data in a CSV file do not need to be in a single table. However, we assume that it contains a rectangular table here for simplicity; the following discussion applies to non-rectangular data..

X=\left[\begin{array}[]{ccc}x_{11}&\cdots&x_{1M}\\ \vdots&&\vdots\\ x_{N1}&\cdots&x_{NM}\\ \end{array}\right]

(1)

Here, $x_{ij}$ is a character string, including a null string, without enclosing double quotes. Then we regard all the fields as a sequence, $x(1),\ldots,x(K)$ , where $x_{ij}=x((i-1)M+j)$ and $K=NM$ . We consider a function to determine whether a field needs to be escaped:

\mathrm{noesc}(x)=\left\{\begin{array}[]{ll}0&\quad\mathrm{if}~{}x~{}\textrm{contains any of comma, double quote, CR, or LF}\\ 1&\quad\mathrm{otherwise}\end{array}\right.

(2)

If $\mathrm{noesc}(x)=0$ , it means that the string $x$ should be enclosed by double quotes. Otherwise, $x$ may be enclosed by double quotes or it may be used without double quotes. Therefore, the fields where $\mathrm{noesc}(x)=1$ can be used for a one-bit payload. The total payload can be calculated as $P(K)$ , where

P(k)=\sum_{i=1}^{k}\mathrm{noesc}(x(i)).

(3)

$P(k)$ is the number of bits we can embed into $x(1),\ldots,x(k).$

Next, we define two functions to calculate the string representation of the data. The first is to put the double quotes in only the needed fields.

J_{\mathrm{simp}}(X)=\mathop{\mathrm{Con}_{\mathrm{\tiny CRLF}}}_{i=1}^{N}\left(\mathop{\mathrm{Con}_{,}}_{j=1}^{M}\mathrm{esc}_{0}(x_{ij})\right)

(4)

Here, $\mathrm{Con}_{\alpha}s_{i}$ means concatenation of strings $s_{1},s_{2},\ldots$ using delimiter $\alpha$ , as follows.

\mathop{\mathrm{Con}_{\alpha}}_{i=1}^{M}s_{i}=s_{1}\circ\alpha\circ s_{2}\circ\alpha\circ\cdots\circ\alpha\circ s_{M},

(5)

where $\circ$ means the string concatenation operator. $\mathrm{esc}_{0}$ is an escape function defined as follows.

\mathrm{esc}_{0}(x)=\left\{\begin{array}[]{cc}\texttt{"}\circ x\circ\texttt{"}&\textrm{if noesc}(x)=0\\ x&\textrm{if noesc}(x)=1\\ \end{array}\right.

(6)

The second function is to embed data into the CSV string. First, we consider the data to be embedded as follows.

\mathbf{b}=b_{1},\ldots,b_{P(K)},\quad b_{k}\in\{0,1\}.

(7)

Then, the function is defined as follows.

	$\displaystyle J_{\mathrm{emb}}(X,\mathbf{b})$	$\displaystyle=$	$\displaystyle\mathop{\mathrm{Con}_{\mathrm{\tiny CRLF}}}_{i=1}^{N}\left(\mathop{\mathrm{Con}_{,}}_{j=1}^{M}\mathrm{esc}_{1}(x_{ij},b_{k})\right)$
			$\displaystyle\mathrm{where}~{}k=P((i-1)M+j)$

\mathrm{esc}_{1}(x,b)=\left\{\begin{array}[]{ll}\texttt{"}\circ x\circ\texttt{"}&\quad\mathrm{if}~{}\mathrm{noesc}(x)=0~{}\mathrm{or}~{}b=1\\ x&\quad\mathrm{otherwise}\end{array}\right.

(9)

The function $\mathrm{esc}_{1}(x)$ encloses $x$ with double quotes when $x$ involves special characters or the bit 1 is embedded. Fig. 3 shows an example of a CSV file with a hidden message. This CSV file has nine fields (three rows, three columns), but the first field of the second row ("Hello,world") cannot be used as a payload because it contains a comma.

"H1",H2,"N1"

"Hello,world","green",3.0

Nice to meet you,"apple","8.0"

Figure 3: An example of a CSV file with embedded information. The payload is 8 bits, and the message 10110011 is embedded.

3.2 Extraction of contents and messages

Consider a CSV file in which each field may or may not be enclosed by double quotes. Then, consider a matrix made by splitting the input CSV file using CRLFs and commas.

Y=\left[\begin{array}[]{ccc}y_{11}&\cdots&y_{1M}\\ \vdots&&\vdots\\ y_{N1}&\cdots&y_{NM}\\ \end{array}\right]

(10)

Here, $y_{ij}$ may or may not be enclosed by double quotes. For example, when the input CSV file contains

"H1",H1,"N1" CRLF "Hello,world","green",3.0,

then the comma-split data is

Y=\left[\begin{array}[]{ccc}\texttt{"H1"}&\texttt{H1}&\texttt{"N1"}\\ \texttt{"Hello,world"}&\texttt{"green"}&\texttt{3.0}\\ \end{array}\right].

(11)

Then, we define two functions, $\mathrm{strip}(y)$ and $\mathrm{extract}(y)$ , as follows.

\mathrm{strip}(y)=\left\{\begin{array}[]{ll}x&\quad\mathrm{if}~{}y=\texttt{"}\circ x\circ\texttt{"}\\ y&\quad\mathrm{otherwise}\end{array}\right.

(12)

\mathrm{extract}(y)=\left\{\begin{array}[]{ll}0&\quad\textrm{if}~{}\mathrm{strip}(y)=y\\ 1&\quad\textrm{else if}~{}\mathrm{noesc}(\mathrm{strip}(y))=1\\ \epsilon&\quad\mathrm{otherwise}\end{array}\right.

(13)

Here, $\mathrm{strip}(y)$ is a function that trims the enclosing double quotes. The function $\mathrm{extract}(y)$ is a function to extract the hidden bit from a field, which returns one of 0, 1, and $\epsilon$ . The value $\epsilon$ means that the field does not contain a hidden bit.

Using $\mathrm{extract}(y)$ , we can extract the hidden message from $Y$ by concatenating all of $\mathrm{extract}(y(1)),\ldots,\mathrm{extract}(y(K))$ and removing all $\epsilon$ from the sequence. We denote this sequence as $\mathrm{extract}(Y).$

3.3 Insertion and validation of digital signature

First, we consider stripping all unnecessary double quotes from the CSV file:

\mathrm{strip}(Y)=\left[\begin{array}[]{ccc}\mathrm{strip}(y_{11})&\cdots&\mathrm{strip}(y_{1M})\\ \vdots&&\vdots\\ \mathrm{strip}(y_{N1})&\cdots&\mathrm{strip}(y_{NM})\\ \end{array}\right]

(14)

Applying strip to $Y$ removes all enclosing double quotes for all fields.

When calculating a signature, we need a pair of the public and private keys, denoted as $\kappa_{\mathrm{\scriptsize pub}}$ and $\kappa_{\mathrm{\scriptsize priv}}$ , respectively.

Then, we insert a signature and validate it as follows. Let a hash function be $H(s)$ that returns a fixed-sized bit sequence, where $s$ is a string. Let an encryption function be $E(\mathbf{b},\kappa)$ , which receives a bit sequence $\mathbf{b}$ and a key $\kappa$ , then returns a bit sequence with the same length as $\mathbf{b}.$ When we have data $Y$ from a CSV file, we calculate the signature $\mathbf{s}$ as

\mathbf{s}=E(H(J_{\mathrm{simp}}(\mathrm{strip}(Y))),\kappa_{\mathrm{priv}})).

(15)

Then, we generate the embedded CSV file as

CSV=J_{\mathrm{emb}}(\mathrm{strip}(Y),\mathbf{s})

(16)

Next, when we receive a signed CSV file, we calculate a matrix $Y^{\prime}$ by splitting the CSV files by CRLFs and commas. Then, we extract the embedded signature as

\mathbf{s}^{\prime}=\mathrm{extract}(Y^{\prime}).

(17)

Now we decrypt $\mathbf{s}^{\prime}$ and compare it to the hash extracted from the stripped CSV file. If

E(\mathbf{s}^{\prime},\kappa_{\mathrm{pub}})=H(J_{\mathrm{simp}}(\mathrm{strip}(Y^{\prime})),

(18)

then the CSV file is validated.

4 Experiment

I carried out a proof-of-concept experiment. The program was implemented using R 4.4.0 with openssl package. Ten CSV files were downloaded from e-gov.go.jp and used for the experiment. All data were open data provided by the Japanese government. Table 1 shows the data. Note that we used data with a payload larger than 512 bits because the signature size was 512 bits. We used RSA key pair and 512-bit SHA1 hash. We zero-padded the embedding bits since all payloads were larger than the signature size.

Table 1: Data used for the experiment. All data were downloaded from e-gov.go.jp.

Filename	Filesize (byte)	Payload (bit)
081_AL_01s_2009.csv	1869	618
151_AB_01s_2006.csv	31081	5598
181_AB_01s_2009.csv	31911	5598
mc010000.csv	5864	1063
mc070000.csv	4781	821
mc110000.csv	12708	1764
s4-7.csv	7024	880
tk9003.csv	2164	581
tk9005.csv	3875	1181
tk9012.csv	3924	696

First, I confirmed whether the embedded CSV files were adequately validated. As a result, all files were validated using the public key.

Next, I checked that the tampered files were detected. To do that, I used LibreOffice to open the CSV files and saved them in CSV format. Although I did not change the contents of the files, loading and saving the files removed the double quotes. As a result, all files were not validated using the public key.

5 Conclusion

This paper proposed a new method for embedding digital signatures into CSV files. The proposed method uses a redundancy: We may or may not enclose the content of a field with double quotes as long as the content does not contain special characters.

There are still several problems in this framework, including the following points:

•

Since the payload is limited, embedding a stronger signature, such as 1024 or 2048 bits, could be difficult.
•

Because of the same reason, it could be difficult to embed other metadata, such as the signer, timestamp, and digital certificate.

These problems should be solved in a future work.

References

[1] Description of digital signatures and code signing in workbooks in excel. https://learn.microsoft.com/en-us/office/troubleshoot/excel/digital-signatures-code-signing, accessed: 2024-07-04
[2] Alkarkoukly, S., Kamal, M.M., Beyan, O.: Breaking Barriers for Interoperability: A Reference Implementation of CSV-FHIR Transformation Using Open-Source Tools, Caring is Sharing – Exploiting the Value in Data for Health and Innovation, vol. 302, pp. 43–47 (2023)
[3] Atenas, J., Havemann, L., Priego, E.: Open data as open educational resources: Towards transversal skills and global citizenship. Open Praxis 7(4), 377–389 (2015)
[4] Christodoulakis, C., Munson, E.B., Gabel, M., Brown, A.D., Miller, R.J.: Pytheas: pattern-based table discovery in CSV files. Proc. VLDB Endow. 13(12), 2075–2089 (jul 2020), https://doi.org/10.14778/3407790.3407810
[5] Crocker, D., Overell, P.: Augmented BNF for syntax specifications: ABNF. RFC 2234 (1997)
[6] Fasli, M., Owda, A.Y., Abbasi, T., Owda, M., Stergioulas, L., Neupane, B.: Open government data (OGD) framework for sustainable development. In: 2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). pp. 576–580. IEEE (2023)
[7] Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista, D., Coles, S., Cornet, R., Courtot, M., Crosas, M., Dumontier, M., Evelo, C.T., et al.: FAIR principles: interpretations and implementation considerations (2020)
[8] Lax, G., Buccafurri, F., Caminiti, G.: Digital document signing: Vulnerabilities and solutions. Information Security Journal: A Global Perspective 24(1-3), 1–14 (2015)
[9] van Loenen, B., Ubacht, J., Labots, W., Zuiderwijk, A.: Log file analytics for gaining insight into actual use of open data. In: Proceedings of the 17th European Conference on Digital Government. In: Borges V, Dias Rouco JC (eds) Academic Conferences and Publishing International Limited, Lisbon. pp. 238–246 (2017)
[10] Máchová, R., Hub, M., Lnenicka, M.: Usability evaluation of open data portals: Evaluating data discoverability, accessibility, and reusability from a stakeholders’ perspective. Aslib Journal of Information Management 70(3), 252–268 (2018)
[11] Mahmud, S.M.H., Hossin, M.A., Hasan, M.R., Jahan, H., Noori, S.R.H., Ahmed, M.R.: Publishing CSV data as linked data on the web. In: Singh, P.K., Panigrahi, B.K., Suryadevara, N.K., Sharma, S.K., Singh, A.P. (eds.) Proceedings of ICETIT 2019. pp. 805–817. Springer International Publishing, Cham (2020)
[12] Mahmud, S.M.H., Hossin, M.A., Jahan, H., Noori, S.R.H., Bhuiyan, T.: CSV-ANNOTATE: Generate annotated tables from CSV file. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 71–75 (2018)
[13] Mitlöhner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: 2016 2nd International Conference on Open and Big Data (OBD). pp. 72–79 (2016)
[14] Numajiri, H., Hayashi, T.: Analysis on open data as a foundation for data-driven research. Scientometrics (mar 2024), https://doi.org/10.1007/s11192-024-04956-x
[15] Okamoto, K.: What is being done with open government data? an exploratory analysis of public uses of new york city open data. Webology 13(1) (2016)
[16] Oliveira, M.I.S., de Oliveira, H.R., Oliveira, L.A., Lóscio, B.F.: Open government data portals analysis: The brazilian case. In: Proceedings of the 17th International Digital Government Research Conference on Digital Government Research. p. 415–424. dg.o ’16, Association for Computing Machinery, New York, NY, USA (2016), https://doi.org/10.1145/2912160.2912163
[17] Pop, F., Dobre, C., Popescu, D., Ciobanu, V., Cristea, V.: Digital certificate management for document workflows in e-government services. In: Electronic Government: 9th IFIP WG 8.5 International Conference, EGOV 2010, Lausanne, Switzerland, August 29-September 2, 2010. Proceedings 9. pp. 363–374. Springer (2010)
[18] Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21(2), 120–126 (1978)
[19] Shafranovich, Y.: Common format and MIME type for comma-separated values (CSV) files. RFC 4180 (2005)
[20] Washington, A.L., Morar, D.: Open government data and file formats: Constraints on collaboration. In: Proceedings of the 18th Annual International Conference on Digital Government Research. p. 155–159. dg.o ’17, Association for Computing Machinery, New York, NY, USA (2017), https://doi.org/10.1145/3085228.3085232
[21] Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR guiding principles for scientific data management and stewardship. Scientific data 3(1), 1–9 (2016)
[22] Wong, A., Liu, V., Caelli, W., Sahama, T.: An architecture for trustworthy open data services. In: Trust Management IX: 9th IFIP WG 11.11 International Conference, IFIPTM 2015, Hamburg, Germany, May 26-28, 2015, Proceedings 9. pp. 149–162. Springer (2015)
[23] Yi, M.: Exploring the quality of government open data. The Electronic Library 37(1), 35–48 (2019), https://doi.org/10.1108/EL-06-2018-0124
[24] Zuiderwijk, A., Spiers, H.: Sharing and re-using open data: A case study of motivations in astrophysics. International Journal of Information Management 49, 228–241 (2019)

Embedding Digital Signature into CSV Files Using Data Hiding